Upload
anton-yazovskiy
View
92
Download
0
Embed Size (px)
DESCRIPTION
The Cassandra database is an excellent choice when you need scalability and high availability without compromising performance. Cassandra’s linear scalability, proven fault tolerance and tunable consistency, combined with its being optimized for write traffic, make it an attractive choice for performing structured logging of application and transactional events. But using a columnar store like Cassandra for analytical needs poses its own problems, problems we solved by careful construction of Column Families combined with diplomatic use of Hadoop. This tutorial focuses on building a similar system from scratch, showing how to perform analytical queries in near real time and still getting the benefits of the high-performance database engine of Cassandra. The key subjects are: • The splendors and miseries of NoSQL • Apache Cassandra use cases • Difficulties of using Map/Reduce directly in Cassandra • Amazon cloud solutions: Elastic MapReduce and S3 • “Real-enough” time analysis
Citation preview
Polyglot Persistence in the Real World
Anton Yazovskiy Thumbtack Technology
� Software Engineer at Thumbtack Technology � an active user of various NoSQL solutions � consulting with focus on scalability � a significant part of my work is advising people on
which solutions to use and why � big fan of BigData and clouds
� NoSQL – not a silver bullet � Choices that we make � Cassandra: operational workload � Cassandra: analytical workload � The best of both worlds � Some benchmarks � Conclusions
• well known ways to scale • scale in/out, scale by
function, data denormalization
• really works • each has disadvantages • mostly manual process
(newSQL)
http://qsec.deviantart.com
� solve exactly these kind of problem � rapid application development
� aggregate � schema flexibility � auto-scale-out � auto-failover
� amount of data able to handle � shared nothing architecture, no SPOF � performance
� splendors and miseries of aggregate � CAP theorem dilemma
Consistency
Partition Tolerance Availability
Analytical Operational
Consistency Availability
Performance Reliability
Analytical Operational
Consistency Availability
Performance Reliability
I want it all
(released by Facebook in 2008)
� elastic scalability & linear performance * � dynamic schema � very high write throughput � tunable per request consistency � fault-tolerant design � multiple datacenter and cloud readiness � CaS transaction support *
* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
� Large data set on commodity hardware � Tradeoff between speed and reliability � Heavy-write workload � Time-series data
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra
Cassandra
Operational
Reliability Performance
Analytical
Small demo after this slide
TIMESTAMP FIELD 1 … 12344567 DATA
SERVER 1 12326346 DATA 13124124 DATA 13237457 DATA
SERVER 2 13627236 DATA
� expensive range queries across cluster � unless shard by timestamp � become a bottleneck for heavy-write workload
select * from table where timestamp > 12344567 and timestamp < 13237457
� all columns are sorted by name � row – aggregate item (never sharded)
Column Family
row key 1 column 1 column 2 column 3 .. column N value 1.1 value 1.2 value 1.3 .. value 1.N
row key 2 column 1 column 2 ... column M value 2.1 value 2.2 … value 2.M
Super columns are discouraged and omitted here
get slice
get range
+ combinations of these queries + composite columns
get key
� all columns are sorted by name � row – aggregate item (never sharded)
row key 1 Emestamp Emestamp Emestamp Emestamp SERVER 1 row key 2 Emestamp Emestamp Emestamp
row key 3 Emestamp row key 4 Emestamp Emestamp Emestamp Emestamp
SERVER 2 row key 5 Emestamp Emestamp
get_slice(“row key 1”, from:“timestamp 1”, null, 11)
get_slice(row_key, from, to, count)
get_slice(“row key 1”, from:“timestamp 1”, null, 11) get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11)
Next page
Prev.page
� all columns are sorted by name � row – aggregate item (never sharded)
row key 1 Emestamp Emestamp Emestamp Emestamp SERVER 1 row key 2 Emestamp Emestamp Emestamp
row key 3 Emestamp row key 4 Emestamp Emestamp Emestamp Emestamp
SERVER 2 row key 5 Emestamp Emestamp
get_slice(row_key, from, to, count)
� Time-range with filter: � “get all events for User J from N to M” � “get all success events for User J from N to M” � “get all events for all user from N to M”
� Time-range with filter: � “get all events for User J from N to M” � “get all success events for User J from N to M” � “get all events for all user from N to M”
events::success::User_123 Emestamp 1
value 1
events::success Emestamp 1
value 1
events::User_123 Emestamp 1
value 1
� Counters: � “get # of events for User J grouped by hour” � “get # of events for User J grouped by day”
events::success::User_123 1380400000 1380403600
14 42
events::User_123 1380400000 1380403600
842 1024
(group by day – same but in different column family for TTL support)
� row key should consist of combination of fields with high cardinality of values: � name, id, etc..
� boolean values are bad option � composite columns – good option for it
� timestamp may help to spread historical data
� otherwise, scalability will not be linear
In theory – possible in real-time � average, 3 dimensional filters, group by, etc..
But: � hard to tune data model � lack of aggregation options � aggregation by historical data
“I want interactive reports”
Cassandra
“Reports could be a little bit out of date, but I want to control this delay value”
Auto update somehow
� Impact on production system or
� Higher total cost of ownership
� Difficulties with scalability
� hard to support with multiple clusters
http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr
http://aws.amazon.com
� Hadoop tech.stack � Automatic deployment � Management API � Temporal cluster � Amazon S3 as data storage *
* copy from S3 to EMR HDFS and back
JobFlowInstancesConfig instances = ..
instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1)
instances.setMasterInstanceType(..)
instances.setSlaveInstanceType(..)
RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar))
AmazonElasticMapReduce emr = ..
emr.runJobFlow(req)
Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar)
AddJobFlowStepsRequest addReq = …
addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig))
AmazonElasticMapReduce emr =
emr.addJobFlowSteps(addReq)
� cluster lifecycle: Long-Running or Transient � cold start = ~20 min � tradeoff: cluster cost VS availability
� Compressing and Combiner tuning may speed-up jobs very much
� common problems for all big data processing tools - monitoring, testability and debug (MRUnit, local hadoop, smaller data set)
try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit()
} catch (Exception e) { sql.rollback() cassandra.rollback(txId)
}
insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..)
update CHANGES set commited = ’true’
where key = 'tx_id-58e0a7d7-eebc’
delete from CHANGES
where key = 'tx_id-58e0a7d7-eebc’
non-production setup: • 3 nodes (cassandra) • m1.medium EC2 instance • 1 data center • 1 app instance
I numbers
real-time metrics update (sync): � average latency - 60 msec � process > 2,000 events per second � generate > 1000 reports per second real-time metrics update (async): � process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps *
it is more then enough, but what if …
� distributed systems force you to make decisions � systems like Cassandra trade speed for
Consistency � CAP theorem is oversimplified
� you have much more options
� polyglot persistence can make this world a better place � do not try to hammer every nail with the same
hammer
� Cassandra – great for time series data and heavy-write workload…
� ... but use cases should be clearly defined
� Amazon S3 – is great � simple, slow, but predictable storage
� Amazon EMR � integration with S3 – great � very good API, but … � … isn’t a magic trick and require
knowledge about Hadoop and skills for effective usage
/** [email protected] @yazovsky www.linkedin.com/in/yazovsky
*/
/** http://www.thumbtack.net http://thumbtack.net/whitepapers
*/