DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup

Analytics on Apache Cassandra,

an Operational Distributed Database

Victor Coustenoble Solutions [email protected]@vizanalytics

Paris Tech Talks Meetup

March 24th 2015

mailto:[email protected]

Apache Cassandra™

• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-critical online applications

• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable

• Masterless with no single point of failure

• Distributed and data center aware

• 100% uptime

• Predictable scaling

• High Performance

• Multi Data Center

• Time Series

• Tunable Consistency

• Simple to Operate

• CQL language

• OpsCenter / DevCenter

Dynamo

BigTable

BigTable: http://research.google.com/archive/bigtable-osdi06.pdf

Dynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

High Availability and Strong Consistency !

• A single node failure shouldn’t bring failure.

• Replication Factor + Consistency Level = Success

• This example:

• RF = 3

• CL = QUORUM (= 51% replicas)

©2014 DataStax Confidential. Do not distribute without consent. 3

Node 1

1st copy

Node 4

Node 5Node 2

2nd copy

Node 3

3rd copy

Parallel

Write

Write

CL=QUORUM

5 μs ack

12 μs ack

12 μs ack

>51% ack – so request is a success

CL(Read) + CL(Write) > RF => Strong Consistency

Real-Time / Operational Big Data Use Cases

Recommendation Engine

Internet of Things

Fraud Detection

Risk Analysis

Buyer Behaviour Analytics

Telematics, Logistics

Business Intelligence

Infrastructure Monitoring

…

How to do analytics on Cassandra data ?

Remember …

Cassandra = NO JOIN , NO GROUP BY , Filter on PK only

Cassandra needs a distributed processing framework

Data model independent queries

Cross-table operations (JOIN, UNION, etc.)

Complex analytics (e.g. machine learning)

Data transformation, aggregation, etc.

Stream processing

Much more …..

Analytics on CassandraThere are 4 ways to do Analytics on Cassandra data:

• Integrated Search (Solr)

• Integrated Batch Analytics (Hadoop integrated) on Cassandra

• External Batch Analytics (External Hadoop; certified with Cloudera, HortonWorks)

• Integrated Near Real-Time Analytics (Spark)

©2014 DataStax Confidential. Do not distribute without consent.

• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..

• Cassandra will replicate the data for you – no ETL is necessary

• Cassandra node started with Solr, Hadoop or Spark

Cassandra

Replication

Transactions Analytics

Enterprise Search

• Built-in enterprise search on Cassandra data via Solr integration

• Facets, Filtering, Geospatial search, Text Analysis, etc.

• Near real-time search operations

• Search queries from CQL and REST/Solr

• Solr shortcomings:

• No bottleneck. Client can read/write to any Solr node.

• Search index partitioning and replication for scalability and availability.

• Multi-DC support

• Data durability (Solr lacks write-ahead log, data can be lost)

8

Cassandra

Replication

Customer

FacingSearch

Nodes

Batch Analytics - Hadoop

• Integrated Hadoop 1.0.4

• CFS (Cassandra File System) , no HDFS

• No Single Point of failure

• No Hadoop complexity – every node is built the same

• Hive / Pig / Sqoop / Mahout


Cassandra

Replication

Customer

FacingHadoop

Nodes

External Batch Analytics - BYOH

Bring Your Own Hadoop

External Hadoop

Resource

Manager

Hive

Request

• Hadoop 2.0.x support

• Cassandra Node as a Data Node

• Ex: Hive submit jobs to Job tracker

assigning tasks to Task trackers

installed on C* nodes

• Certified with Cloudera, HortonWorks

Cassandra

Nodes

Real-Time Analytics - Spark

• Tight integration with Cassandra

• Distributed Processing

• “In-memory Map/Reduce”, multi-thread, best for iterations

• GraphX, MLLib, SparkSQL, Shark (Hive SQL like)

• Spark Streaming - Real-time

• DataStax / Databricks partnership

• 10x – 100x speed of MapReduce


Cassandra

Replication

Customer

FacingSpark

Nodes

« Big Data » SDK

Real-time Big Data


Data Enrichment

Batch Processing

Machine Learning

Pre-computed

aggregates

Data

NO ETL

Spark Use Cases

13

Load data from various

sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration,

Data conversion

Hot / Cold Data in a DataStax architecture

© 2014 DataStax, All Rights Reserved. Company Confidential

Hot Data

Online Operational Application

Cold Data

Offline Application

DataStax Cassandra Enterprise

14

DataStax Enterprise vs. Hadoop


NoSQL Matters Paris

© 2014 DataStax, All Rights Reserved. Company Confidential 16

Tracks from Duy Hai Doan – Cassandra technical advocate

@doanduyhai

• Day 1 (Thursday 26) 13:45 – 17:45

Training : Introduction to Apache Cassandra, CQL and Data Modelling

• Day 2 (Friday 27) 16:30 – 17:15

Conference : Real time analytics with Cassandra and Spark

Cassandra Days

Company Confidential 17

Thanks

We power the big data apps that transform business.


Software

DataStax - Analytics on Apache Cassandra - Paris Tech Talks meetup