Upload
stratio
View
966
Download
0
Embed Size (px)
DESCRIPTION
Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.
Citation preview
Hadoop?
Cassandra?
Spark?
Stratio Deep
An efficient data mining solution
“Two and two are four?
Sometimes… Sometimes they are five.”
G. Orwell
#StratioBD
Goals
• Why do you need Cassandra?• What is the problem?• Why do you need Spark?• How do they work together?
#StratioBD
Cassandra
#StratioBD
• Based on DynamoDB…• Replication, Key/Value, P2P• And based on Big Table…• Column oriented
ROBUST FAST EFFICENT
NO BOTTLENECK REPLICATE
DDECENTRALIZED
Another Databas
e?
Why?
One User – Lot of data
Case A
#StratioBD
Many User – Few data
Case B
#StratioBD
Many user – Lot of data
Case C
#StratioBD
Crawler app
#StratioBD
Cassandra, I choose you
100M
Indexedpages
3kreads
Query time
< 1s
But…
Marketingwalks in
New query
“I need to find all the reference to the domain
ACME. I need the answer by Friday.”
#StratioBD
Problem
Cassandra is not well suited to resolved this
type of queries
You need to design the schema with the query
in mind
#StratioBD
ChallengeAccepted
What options do we have?
• Run Hive Query on top of C*• Write an ETL script and load data into
another DB• Clone the cluster
#StratioBD
What options do we have?
Run Hive Query on top of C*
Write ETL scripts and load into another DB
Clone the cluster
#StratioBD
And now… what can we do?
“We can't solve problems by using the same kind of thinking
we used when we created them”
#StratioBD
Albert Einstein
• Alternative to MapReduce• A low latency cluster computing system• For very large datasets• Create by UC Berkeley AMP Lab in 2010.• May be 100 times faster than MapReduce for:
Interactive algorithms. Interactive data mining
Spark
#StratioBD
Logistic regression inSpark vs Hadoop
SOURCE | http://spark.incubator.apache.org/
#StratioBD
WHO USES SPARK?
Spark and Cassandra
Integration points
#StratioBD
Cassandra’s HDFS abstraction layer
Advantantages:• Easily integrates with legacy systems.
Drawbacks:• Very high-level: no access to low level Cassandra’s features.• Questionable performance.
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
Cassandra’s Hadoop Interface• Thrift protocol• CQL3 (our implementation)
Uses the novel Cassandra’s
CqlPagingInputFormat
INTEGRATION POINTS: HDFS OVER CASSANDRA
#StratioBD
• Supports CQL3 features• Respects data locality • Good compromise between performance / implementation complexity
CQL3 Integration
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
CQL3 Integration (II)
Provides a Java friendly API:
• Developers map Column Families to custom serializable
POJOs
• StratioDeep wraps the complexity of performing Spark
calculations directly over the user provided POJOs.
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3
#StratioBD
Demo
Drawbacks:
• Still not preforming as well as we’d like
Uses Cassandra’s Hadoop Interface• No analyst-friendly interface:
No SQL-like query features
CQL3 Integration (III)
INTEGRATION POINTS: CASSANDRA’S HADOOP INTERFACE – CQL3#StratioBD
Bring the integration to another level:
• Dump Cassandra’s Hadoop Interface• Direct access to Cassandra’s SSTable(s) files.• Extend Cassandra’s CQL3 to make use of Spark’s
distributed data processing power
Future extensions
What are we currently working on?
#StratioBD
#StratioBD
Conclusion
THANKS