Upload
victor-coustenoble
View
313
Download
7
Tags:
Embed Size (px)
Citation preview
Analytics on Apache Cassandra,
an Operational Distributed Database
Victor Coustenoble Solutions [email protected]@vizanalytics
Paris Tech Talks Meetup
March 24th 2015
Apache Cassandra™
• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable
• Masterless with no single point of failure
• Distributed and data center aware
• 100% uptime
• Predictable scaling
• High Performance
• Multi Data Center
• Time Series
• Tunable Consistency
• Simple to Operate
• CQL language
• OpsCenter / DevCenter
Dynamo
BigTable
BigTable: http://research.google.com/archive/bigtable-osdi06.pdf
Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
High Availability and Strong Consistency !
• A single node failure shouldn’t bring failure.
• Replication Factor + Consistency Level = Success
• This example:
• RF = 3
• CL = QUORUM (= 51% replicas)
©2014 DataStax Confidential. Do not distribute without consent. 3
Node 1
1st copy
Node 4
Node 5Node 2
2nd copy
Node 3
3rd copy
Parallel
Write
Write
CL=QUORUM
5 μs ack
12 μs ack
12 μs ack
>51% ack – so request is a success
CL(Read) + CL(Write) > RF => Strong Consistency
Real-Time / Operational Big Data Use Cases
Recommendation Engine
Internet of Things
Fraud Detection
Risk Analysis
Buyer Behaviour Analytics
Telematics, Logistics
Business Intelligence
Infrastructure Monitoring
…
How to do analytics on Cassandra data ?
Remember …
Cassandra = NO JOIN , NO GROUP BY , Filter on PK only
Cassandra needs a distributed processing framework
Data model independent queries
Cross-table operations (JOIN, UNION, etc.)
Complex analytics (e.g. machine learning)
Data transformation, aggregation, etc.
Stream processing
Much more …..
Analytics on CassandraThere are 4 ways to do Analytics on Cassandra data:
• Integrated Search (Solr)
• Integrated Batch Analytics (Hadoop integrated) on Cassandra
• External Batch Analytics (External Hadoop; certified with Cloudera, HortonWorks)
• Integrated Near Real-Time Analytics (Spark)
©2014 DataStax Confidential. Do not distribute without consent.
• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..
• Cassandra will replicate the data for you – no ETL is necessary
• Cassandra node started with Solr, Hadoop or Spark
Cassandra
Replication
Transactions Analytics
Enterprise Search
• Built-in enterprise search on Cassandra data via Solr integration
• Facets, Filtering, Geospatial search, Text Analysis, etc.
• Near real-time search operations
• Search queries from CQL and REST/Solr
• Solr shortcomings:
• No bottleneck. Client can read/write to any Solr node.
• Search index partitioning and replication for scalability and availability.
• Multi-DC support
• Data durability (Solr lacks write-ahead log, data can be lost)
8
Cassandra
Replication
Customer
FacingSearch
Nodes
Batch Analytics - Hadoop
• Integrated Hadoop 1.0.4
• CFS (Cassandra File System) , no HDFS
• No Single Point of failure
• No Hadoop complexity – every node is built the same
• Hive / Pig / Sqoop / Mahout
©2014 DataStax Confidential. Do not distribute without consent. 9
Cassandra
Replication
Customer
FacingHadoop
Nodes
External Batch Analytics - BYOH
Bring Your Own Hadoop
External Hadoop
Resource
Manager
Hive
Request
• Hadoop 2.0.x support
• Cassandra Node as a Data Node
• Ex: Hive submit jobs to Job tracker
assigning tasks to Task trackers
installed on C* nodes
• Certified with Cloudera, HortonWorks
Cassandra
Nodes
Real-Time Analytics - Spark
• Tight integration with Cassandra
• Distributed Processing
• “In-memory Map/Reduce”, multi-thread, best for iterations
• GraphX, MLLib, SparkSQL, Shark (Hive SQL like)
• Spark Streaming - Real-time
• DataStax / Databricks partnership
• 10x – 100x speed of MapReduce
©2014 DataStax Confidential. Do not distribute without consent. 11
Cassandra
Replication
Customer
FacingSpark
Nodes
« Big Data » SDK
Real-time Big Data
©2014 DataStax Confidential. Do not distribute without consent. 12
Data Enrichment
Batch Processing
Machine Learning
Pre-computed
aggregates
Data
NO ETL
Spark Use Cases
13
Load data from various
sources
Analytics (join, aggregate, transform, …)
Sanitize, validate, normalize data
Schema migration,
Data conversion
Hot / Cold Data in a DataStax architecture
© 2014 DataStax, All Rights Reserved. Company Confidential
Hot Data
Online Operational Application
Cold Data
Offline Application
DataStax Cassandra Enterprise
14
DataStax Enterprise vs. Hadoop
©2014 DataStax Confidential. Do not distribute without consent.
NoSQL Matters Paris
© 2014 DataStax, All Rights Reserved. Company Confidential 16
Tracks from Duy Hai Doan – Cassandra technical advocate
@doanduyhai
• Day 1 (Thursday 26) 13:45 – 17:45
Training : Introduction to Apache Cassandra, CQL and Data Modelling
• Day 2 (Friday 27) 16:30 – 17:15
Conference : Real time analytics with Cassandra and Spark
Cassandra Days
Company Confidential 17
Thanks
We power the big data apps that transform business.
©2013 DataStax Confidential. Do not distribute without consent.