C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, Now & Future by Ameet...

Preview:

DESCRIPTION

The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a database and JMS queue. However these solutions had their limits and presented serious problems in terms of scale and redundancy. The new breed of products - a la Cassandra & Kafka, being innately distributed in their design, aim to tackle such challenges in a very elegant manner. The presentation will showcase some of the use cases of this genre from the industry and describe the solutions which have been increasing in their sophistication.

Citation preview

Cassandra & Next Generation Analysis

Cassandra for a high-velocity data ingestion and real-time analysis system.

Ameet Chaubal & Fausto Inestroza

Presentation Route • Describe  conven,onal  technology  solu,on  

• Highlight  deficiencies  • Showcase  new  solu,on  implemented  using  Cassandra  

• Layout  architecture  with  improvements  

Business Case •  Capture messages from high-volume e-

Commerce site. •  Store them into a database •  Perform near real-time queries for

troubleshooting •  Perform deeper analysis a la BI.

Olden Days …

JMS Queue

Transient Storage RDBMS

Data warehouse Analysis

eCommerce Website

Business Case, Details… Messages: 5000 msg/sec ~ 250 million / day Message size : 1 Kb

JMS Queue

Transient Storage RDBMS

Data warehouse

eCommerce Website

Decouple UI from storage Multiple sinks

Dedicated storage Triage

Data Analysis Business Intelligence

What’s the problem?

JMS Queue

Data warehouse

SITE I

SITE II

JMS Queue

•  Queue Replication problems

•  Message Loss •  Other applications

affected in case of failover

•  Triage data isolated •  No universal view •  Data Consolidation

adds delay •  Inability to keep up

with increasing messages

•  Analysis always lagging the action

•  No low-latency queries

Batch Load

Transient storage

Problems Recap • Over  5000  msg/sec  High  Write  Speed  

• Extrac9on  &  Load  very  slow  ETL  from  Transient  storage  to  Data  warehouse  takes  over  4  hours  

• Analysis  always  lags  events  by  hours  ETL  performed  in  batches  4  hours  apart  

• No  high  availability  No  Geo-­‐Redundancy  for  Transient  Storage  

• Data  stored  in  disparate  buckets  No  Universal  view  of  data  for  “Triage”  applica9ons/troubleshoo9ng  

• No  dashboard    No  low-­‐latency  queries  

•  No  immediate  alert,  paRern  detec9on  No  real-­‐9me  analysis  

Thrift Connection

Pool

Online e-Commerce Application

Event JMS

A3

Load Balancer

VIP A6 A5

Replication Consumers

Hector / Java Client -1

Hector / Java Client -2

Hector / Java Client -n

JMS Publisher

A1

A2

Cassandra

A7

A4

Write event to queue

Fetch from queue

Cassandra + Hadoop

A8

Map/Reduce

Hive Queries/BI

Real-Time Dashboard A9

A10

A12 Solution Blueprint

Role of Data Model Before we get there,

what features are missing from Cassandra in comparison to traditional RDBMS

Shortcomings… Opportunities •  No Joins across Column Families •  No analytical functions such as sum, count…

•  Difficulty in constructing “WHERE” clause predicates across composite columns

•  Inability to order range of Keys in Random Partitioner

Importance of Data model - Cassandra •  In lieu of JOINS, “smart” de-normalization techniques

are crucial. •  Need to use “FEATURES” of Cassandra to effectively

model the business rules and business data •  “Client” or “Application” code becomes extremely

important. •  “APPLICATION” + “DATABASE” => Full Package

Features of Cassandra Modeling •  “WIDE” Column Family

–  Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS •  Automatic Sorting of Columns

–  Important to “MODEL” the data in “COLUMNS” as opposed to rows.

•  Faster Access to ALL COLUMNS of a Row Key –  All columns of a row key stored on ONE server =>fast iteration/aggregations

•  Useful info in “COLUMN NAME” –  Ground breaking from RDBMS perspective –  Enables “MORE” “INFORMATION” to be PACKED –  “COLUMN” as entity becomes “MORE POWERFUL”.

•  COMPOSITE Column NAMES: –  Column names can be COMPOSITES !!! Made up of multiple columns –  Auto sorting still works

Data Model Wide  rows  with  sharding  

Row  Key  =  “<min>|<part#>”  

Role  of  par99on  #:    •  Each  row  is  stored  by  a  single  server  and  with  5,000x60=300,000  events  per  minute,  that  would  put  large  load  for  a  minute  on  a  single  server.    

•  A  “par99on”  contrap9on  aims  to  “break”  this  huge  row,  remove  hotspots  and  spread  the  load  to  possibly  all  servers  

•  The  #  of  par99ons,  some  mul9ple  of  the  #  of  servers  •  Finite  #  of  par99ons  –  s9ll  maintains  the  row  key  as  meaningful,  i.e.  we  can  construct  the  keys  for  a  certain  minute  and  fetch  records  for  them.  

Composite Columns •  Composite Columns:

–  Actual message stored as part of composite column

•  Variable granularity grouping –  Minute: Row key based on minute

Min_par((on  (TEXT)   DC:TimeUUID:UserID:Message(Composite)   …  

2012-­‐07-­‐18-­‐08-­‐13-­‐p-­‐1   Status  

…   …  

2012-­‐07-­‐19-­‐11-­‐21-­‐p-­‐3   Status  

Benefits

Data Center 3 (RO)

Data Center 2 (RW)

Data Center 1 (RW)

Geo-Redundancy

16

Data Center 4 (RO)

Data Consolidation and Extraction •  Single view of data across multiple locations •  Data extraction can be performed in parallel •  Data extraction process performed in

dedicated cluster of machines.

Low-Latency & Batch Applications •  Triaging

–  Troubleshooting customer issues within 10 minutes of occurrence

–  Feeding a dashboard of live feed data through aggregations performed in Counter CFs

•  Analysis –  Analytical and ad Hoc queries to replace the need

for remote data warehouse eventually –  Map/Reduce via Hive without ETL

Opportunities Remaining •  Near real-time pattern detection and

response •  Message loss in JMS queue •  JMS queue replication. •  reducing the impact of Queue failover on

other applications

Further Improvements…

HOW ???

Accenture  Cloud  PlaAorm  

Recommender  as  a  Service  

…  

Network  Analy9cs  Services  

Big Data Platform

Drivers

consumer devices

video usage

Issues

Operational Costs

Understanding service quality degradation

Inefficient capacity planning

INGEST   PROCESS  

VISUALIZE  

ANALYZE  

STORE  

WHY STORM?

Scalability

Reliability

Data types, size, velocity

Mission critical data

Processing, computation, etc.

Time series / pattern analysis

Fault-tolerance

What do we need?

Multiple use cases

How do we get this from Storm?

Processing guarantees

Low-level Primitives

Parallelization

Robust fail-over strategies

Scalability

Reliability

Fault-tolerance

Processing, computation, etc.

PRIMITIVES  

Stream  

Spout  

Bolt  

Topology   Subop(mal  network  speed,  geospa(al  analysis    

Request  info  (IP,  user-­‐agent,  etc)  

Pull  messages  from  distributed  queue  

Sessioniza(on,  speed  calcula(on    

Tuple   Tuple   Tuple  

Integration with Cassandra

Cassandra Optimal for time series data Near-linear scalable Low read/write latency Scales in conjunction with Storm

Custom Bolt Uses Hector API to access Cassandra Creates dynamic columns per request Stores relevant network data

SUBOPTIMAL NETWORK SPEED TOPOLOGY AN EXAMPLE

KaUa  Spout   Pre-­‐process   Sessionize  

Calculate  N/W  Speed  per  

Session  

Update  Speed  per  IP  

Iden(fy  Sub-­‐Op(mal  Speed  

Store  in  Cassandra  

Cassandra  

Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)   Tuple  (ip  1)  Tuple  (ip  1)  

Cassandra  

KaUa  Spout   Pre-­‐process   Sessionize  

Calculate  N/W  Speed  per  

Session  

Update  Speed  per  IP  

Iden(fy  Sub-­‐Op(mal  Speed  

Store  in  Cassandra  

Tuple  (ip  2)  Tuple  (ip  2)  Tuple  (ip  2)  

Tuple  (ip  1)  

Tuple  (ip  2)  

Tuple  (ip  1)   Tuple  (ip  1)  

Tuple  (ip  2)   Tuple  (ip  2)  Tuple  (ip  2)  

Tuple  (ip  1)  

Tuple  (ip  2)  

Tuple  (ip  1)  

Tuple  (ip  2)  

Tuple  (ip  1)  Tuple  (ip  1)  Tuple  (ip  1)  

Tuple  (ip  1)  

Parallelism  

Cassandra  

KaUa  Spout   Pre-­‐process   Sessionize  

Calculate  N/W  Speed  per  

Session  

Update  Speed  per  IP   Join   Compare  

Speed  Store  in  Cassandra  

Speed  by  Loca(on  

Stream  1  

Stream  2  

KaUa  Spout  

Tuple  (ip  1)   Tuple  (ip  1/NY)  

Tuple  (NY)  

Tuple  (ip  1/NY)  

Branching  and  Joins  

Lessons Learned

•  Rebalance Topology

•  Tweak parallelism in bolt

•  Isolation of Topologies

•  Use TimeUUIDUtils

•  Log4j level set to INFO by default

Thank You

Q & A