Cassandra & Next Generation Analysis
Cassandra for a high-velocity data ingestion and real-time analysis system.
Ameet Chaubal & Fausto Inestroza
Presentation Route • Describe conven,onal technology solu,on
• Highlight deficiencies • Showcase new solu,on implemented using Cassandra
• Layout architecture with improvements
Business Case • Capture messages from high-volume e-
Commerce site. • Store them into a database • Perform near real-time queries for
troubleshooting • Perform deeper analysis a la BI.
Olden Days …
JMS Queue
Transient Storage RDBMS
Data warehouse Analysis
eCommerce Website
Business Case, Details… Messages: 5000 msg/sec ~ 250 million / day Message size : 1 Kb
JMS Queue
Transient Storage RDBMS
Data warehouse
eCommerce Website
Decouple UI from storage Multiple sinks
Dedicated storage Triage
Data Analysis Business Intelligence
What’s the problem?
JMS Queue
Data warehouse
SITE I
SITE II
JMS Queue
• Queue Replication problems
• Message Loss • Other applications
affected in case of failover
• Triage data isolated • No universal view • Data Consolidation
adds delay • Inability to keep up
with increasing messages
• Analysis always lagging the action
• No low-latency queries
Batch Load
Transient storage
Problems Recap • Over 5000 msg/sec High Write Speed
• Extrac9on & Load very slow ETL from Transient storage to Data warehouse takes over 4 hours
• Analysis always lags events by hours ETL performed in batches 4 hours apart
• No high availability No Geo-‐Redundancy for Transient Storage
• Data stored in disparate buckets No Universal view of data for “Triage” applica9ons/troubleshoo9ng
• No dashboard No low-‐latency queries
• No immediate alert, paRern detec9on No real-‐9me analysis
Thrift Connection
Pool
Online e-Commerce Application
Event JMS
A3
Load Balancer
VIP A6 A5
Replication Consumers
Hector / Java Client -1
Hector / Java Client -2
Hector / Java Client -n
JMS Publisher
A1
A2
Cassandra
A7
A4
Write event to queue
Fetch from queue
Cassandra + Hadoop
A8
Map/Reduce
Hive Queries/BI
Real-Time Dashboard A9
A10
A12 Solution Blueprint
Role of Data Model Before we get there,
what features are missing from Cassandra in comparison to traditional RDBMS
Shortcomings… Opportunities • No Joins across Column Families • No analytical functions such as sum, count…
• Difficulty in constructing “WHERE” clause predicates across composite columns
• Inability to order range of Keys in Random Partitioner
Importance of Data model - Cassandra • In lieu of JOINS, “smart” de-normalization techniques
are crucial. • Need to use “FEATURES” of Cassandra to effectively
model the business rules and business data • “Client” or “Application” code becomes extremely
important. • “APPLICATION” + “DATABASE” => Full Package
Features of Cassandra Modeling • “WIDE” Column Family
– Organize data in “horizontal” as opposed to “vertical” fashion as in RDBMS • Automatic Sorting of Columns
– Important to “MODEL” the data in “COLUMNS” as opposed to rows.
• Faster Access to ALL COLUMNS of a Row Key – All columns of a row key stored on ONE server =>fast iteration/aggregations
• Useful info in “COLUMN NAME” – Ground breaking from RDBMS perspective – Enables “MORE” “INFORMATION” to be PACKED – “COLUMN” as entity becomes “MORE POWERFUL”.
• COMPOSITE Column NAMES: – Column names can be COMPOSITES !!! Made up of multiple columns – Auto sorting still works
Data Model Wide rows with sharding
Row Key = “<min>|<part#>”
Role of par99on #: • Each row is stored by a single server and with 5,000x60=300,000 events per minute, that would put large load for a minute on a single server.
• A “par99on” contrap9on aims to “break” this huge row, remove hotspots and spread the load to possibly all servers
• The # of par99ons, some mul9ple of the # of servers • Finite # of par99ons – s9ll maintains the row key as meaningful, i.e. we can construct the keys for a certain minute and fetch records for them.
Composite Columns • Composite Columns:
– Actual message stored as part of composite column
• Variable granularity grouping – Minute: Row key based on minute
Min_par((on (TEXT) DC:TimeUUID:UserID:Message(Composite) …
2012-‐07-‐18-‐08-‐13-‐p-‐1 Status
… …
2012-‐07-‐19-‐11-‐21-‐p-‐3 Status
Benefits
Data Center 3 (RO)
Data Center 2 (RW)
Data Center 1 (RW)
Geo-Redundancy
16
Data Center 4 (RO)
Data Consolidation and Extraction • Single view of data across multiple locations • Data extraction can be performed in parallel • Data extraction process performed in
dedicated cluster of machines.
Low-Latency & Batch Applications • Triaging
– Troubleshooting customer issues within 10 minutes of occurrence
– Feeding a dashboard of live feed data through aggregations performed in Counter CFs
• Analysis – Analytical and ad Hoc queries to replace the need
for remote data warehouse eventually – Map/Reduce via Hive without ETL
Opportunities Remaining • Near real-time pattern detection and
response • Message loss in JMS queue • JMS queue replication. • reducing the impact of Queue failover on
other applications
Further Improvements…
HOW ???
Accenture Cloud PlaAorm
Recommender as a Service
…
Network Analy9cs Services
Big Data Platform
Drivers
consumer devices
video usage
Issues
Operational Costs
Understanding service quality degradation
Inefficient capacity planning
INGEST PROCESS
VISUALIZE
ANALYZE
STORE
WHY STORM?
Scalability
Reliability
Data types, size, velocity
Mission critical data
Processing, computation, etc.
Time series / pattern analysis
Fault-tolerance
What do we need?
Multiple use cases
How do we get this from Storm?
Processing guarantees
Low-level Primitives
Parallelization
Robust fail-over strategies
Scalability
Reliability
Fault-tolerance
Processing, computation, etc.
PRIMITIVES
Stream
Spout
Bolt
Topology Subop(mal network speed, geospa(al analysis
Request info (IP, user-‐agent, etc)
Pull messages from distributed queue
Sessioniza(on, speed calcula(on
Tuple Tuple Tuple
Integration with Cassandra
Cassandra Optimal for time series data Near-linear scalable Low read/write latency Scales in conjunction with Storm
Custom Bolt Uses Hector API to access Cassandra Creates dynamic columns per request Stores relevant network data
SUBOPTIMAL NETWORK SPEED TOPOLOGY AN EXAMPLE
KaUa Spout Pre-‐process Sessionize
Calculate N/W Speed per
Session
Update Speed per IP
Iden(fy Sub-‐Op(mal Speed
Store in Cassandra
Cassandra
Tuple (ip 1) Tuple (ip 1) Tuple (ip 1) Tuple (ip 1) Tuple (ip 1) Tuple (ip 1) Tuple (ip 1)
Cassandra
KaUa Spout Pre-‐process Sessionize
Calculate N/W Speed per
Session
Update Speed per IP
Iden(fy Sub-‐Op(mal Speed
Store in Cassandra
Tuple (ip 2) Tuple (ip 2) Tuple (ip 2)
Tuple (ip 1)
Tuple (ip 2)
Tuple (ip 1) Tuple (ip 1)
Tuple (ip 2) Tuple (ip 2) Tuple (ip 2)
Tuple (ip 1)
Tuple (ip 2)
Tuple (ip 1)
Tuple (ip 2)
Tuple (ip 1) Tuple (ip 1) Tuple (ip 1)
Tuple (ip 1)
Parallelism
Cassandra
KaUa Spout Pre-‐process Sessionize
Calculate N/W Speed per
Session
Update Speed per IP Join Compare
Speed Store in Cassandra
Speed by Loca(on
Stream 1
Stream 2
KaUa Spout
Tuple (ip 1) Tuple (ip 1/NY)
Tuple (NY)
Tuple (ip 1/NY)
Branching and Joins
Lessons Learned
• Rebalance Topology
• Tweak parallelism in bolt
• Isolation of Topologies
• Use TimeUUIDUtils
• Log4j level set to INFO by default
Thank You
Q & A