Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy

Polyglot Persistence in the Real World

Anton Yazovskiy Thumbtack Technology

� Software Engineer at Thumbtack Technology �  an active user of various NoSQL solutions �  consulting with focus on scalability �  a significant part of my work is advising people on

which solutions to use and why �  big fan of BigData and clouds

� NoSQL – not a silver bullet � Choices that we make � Cassandra: operational workload � Cassandra: analytical workload � The best of both worlds � Some benchmarks � Conclusions

•  well known ways to scale •  scale in/out, scale by

function, data denormalization

•  really works •  each has disadvantages •  mostly manual process

(newSQL)

http://qsec.deviantart.com

�  solve exactly these kind of problem �  rapid application development

�  aggregate �  schema flexibility �  auto-scale-out �  auto-failover

� amount of data able to handle �  shared nothing architecture, no SPOF � performance

�  splendors and miseries of aggregate � CAP theorem dilemma

Consistency

Partition Tolerance Availability

Analytical Operational

Consistency Availability

Performance Reliability

Analytical Operational

Consistency Availability

Performance Reliability

I want it all

(released by Facebook in 2008)

� elastic scalability & linear performance * � dynamic schema � very high write throughput �  tunable per request consistency �  fault-tolerant design � multiple datacenter and cloud readiness � CaS transaction support *

* http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

� Large data set on commodity hardware � Tradeoff between speed and reliability � Heavy-write workload � Time-series data

http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-cassandra

Cassandra

Operational

Reliability Performance

Analytical

Small demo after this slide

TIMESTAMP FIELD 1 … 12344567 DATA

SERVER 1 12326346 DATA 13124124 DATA 13237457 DATA

SERVER 2 13627236 DATA

� expensive range queries across cluster � unless shard by timestamp � become a bottleneck for heavy-write workload

select * from table where timestamp > 12344567 and timestamp < 13237457

�  all columns are sorted by name �  row – aggregate item (never sharded)

Column Family

row key 1 column 1 column 2 column 3 .. column N value 1.1 value 1.2 value 1.3 .. value 1.N

row key 2 column 1 column 2 ... column M value 2.1 value 2.2 … value 2.M

Super columns are discouraged and omitted here

get slice

get range

+ combinations of these queries + composite columns

get key


row key 1 Emestamp Emestamp Emestamp Emestamp SERVER 1 row key 2 Emestamp Emestamp Emestamp

row key 3 Emestamp row key 4 Emestamp Emestamp Emestamp Emestamp

SERVER 2 row key 5 Emestamp Emestamp

get_slice(“row key 1”, from:“timestamp 1”, null, 11)

get_slice(row_key, from, to, count)

get_slice(“row key 1”, from:“timestamp 1”, null, 11) get_slice(“row key 1”, from:“timestamp 11”, null, 11) get_slice(“row key 1”, null, to:“timestamp 11”, 11)

Next page

Prev.page


row key 1 Emestamp Emestamp Emestamp Emestamp SERVER 1 row key 2 Emestamp Emestamp Emestamp

row key 3 Emestamp row key 4 Emestamp Emestamp Emestamp Emestamp

SERVER 2 row key 5 Emestamp Emestamp

get_slice(row_key, from, to, count)

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

� Time-range with filter: �  “get all events for User J from N to M” �  “get all success events for User J from N to M” �  “get all events for all user from N to M”

events::success::User_123 Emestamp 1

value 1

events::success Emestamp 1

value 1

events::User_123 Emestamp 1

value 1

� Counters: �  “get # of events for User J grouped by hour” �  “get # of events for User J grouped by day”

events::success::User_123 1380400000 1380403600

14 42

events::User_123 1380400000 1380403600

842 1024

(group by day – same but in different column family for TTL support)

�  row key should consist of combination of fields with high cardinality of values: �  name, id, etc..

�  boolean values are bad option �  composite columns – good option for it

�  timestamp may help to spread historical data

�  otherwise, scalability will not be linear

In theory – possible in real-time �  average, 3 dimensional filters, group by, etc..

But: �  hard to tune data model �  lack of aggregation options �  aggregation by historical data

“I want interactive reports”

Cassandra

“Reports could be a little bit out of date, but I want to control this delay value”

Auto update somehow

�  Impact on production system or

�  Higher total cost of ownership

�  Difficulties with scalability

�  hard to support with multiple clusters

http://www.datastax.com/docs/0.7/map_reduce/hadoop_mr

http://aws.amazon.com

�  Hadoop tech.stack �  Automatic deployment �  Management API �  Temporal cluster �  Amazon S3 as data storage *

* copy from S3 to EMR HDFS and back

JobFlowInstancesConfig instances = ..

instances.setHadoopVersion(..) instances.setInstanceCount(dataNodeCount + 1)

instances.setMasterInstanceType(..)

instances.setSlaveInstanceType(..)

RunJobFlowRequest req = ..(name, instances) req.addSteps(new StepConfig(name, jar))

AmazonElasticMapReduce emr = ..

emr.runJobFlow(req)

Execute job on running cluster: StepConfig stepConfig = new StepConfig(name, jar)

AddJobFlowStepsRequest addReq = …

addReq.setJobFlowId(jobFlowId) addReq.setSteps(Arrays.asList(stepConfig))

AmazonElasticMapReduce emr =

emr.addJobFlowSteps(addReq)

�  cluster lifecycle: Long-Running or Transient �  cold start = ~20 min �  tradeoff: cluster cost VS availability

�  Compressing and Combiner tuning may speed-up jobs very much

�  common problems for all big data processing tools - monitoring, testability and debug (MRUnit, local hadoop, smaller data set)

try { long txId = cassandra.persist(entity) sql.insert(some) sql.update(someElse) cassandra.commit(txId) sql.commit()

} catch (Exception e) { sql.rollback() cassandra.rollback(txId)

}

insert into CHANGES (key, commited, data) values ('tx_id-58e0a7d7-eebc', ’false’, ..)

update CHANGES set commited = ’true’

where key = 'tx_id-58e0a7d7-eebc’

delete from CHANGES

where key = 'tx_id-58e0a7d7-eebc’

non-production setup: •  3 nodes (cassandra) •  m1.medium EC2 instance •  1 data center •  1 app instance

I numbers

real-time metrics update (sync): �  average latency - 60 msec �  process > 2,000 events per second �  generate > 1000 reports per second real-time metrics update (async): �  process > 15,000 events per second uploading to AWS S3: slow, but multi-threading helps *

it is more then enough, but what if …

� distributed systems force you to make decisions �  systems like Cassandra trade speed for

Consistency � CAP theorem is oversimplified

�  you have much more options

� polyglot persistence can make this world a better place �  do not try to hammer every nail with the same

hammer

� Cassandra – great for time series data and heavy-write workload…

�  ... but use cases should be clearly defined

� Amazon S3 – is great �  simple, slow, but predictable storage

� Amazon EMR �  integration with S3 – great �  very good API, but … �  … isn’t a magic trick and require

knowledge about Hadoop and skills for effective usage

/** [email protected] @yazovsky www.linkedin.com/in/yazovsky

*/

/** http://www.thumbtack.net http://thumbtack.net/whitepapers

*/