27
June 19, 2014 Cassandra at Vast Graham Sanderson - CTO, David Pratt - Director of Applications 1

Austin Cassandra Users 6/19: Apache Cassandra at Vast

Embed Size (px)

DESCRIPTION

For our June meetup, we'll have our local friends at www.vast.com presenting some of their current use cases for Cassandra. Additionally, Vast will be talking about a non-blocking Scala client that they have developed in house.

Citation preview

Page 1: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Cassandra at VastGraham Sanderson - CTO, David Pratt - Director of Applications

1

Page 2: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Introduction

2

• Don’t want this to be a data modeling talk

• We aren't experts - we are learning as we go

• Hopefully this will be useful to both you and us • Informal, questions as we go

• We will share our experiences so far moving to Cassandra • We are working on a bunch existing and new projects

• We'll talk about 2 1/2 of them

• Some dev stuff, some ops stuff

• Some thoughts for the future

• Athena Scala Driver

Page 3: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Who is Vast?

3

• Vast operates while-label performance based marketplaces for publishers; and delivers big data mobile applications for automotive and real estate sales professionals

• “Big Data for Big Purchases”

• Marketplaces • Large partner sites, including AOL, CARFAX, TrueCar, Realogy, USAA,

Yahoo

• Hundreds of smaller partner sites

• Analytics • Strong team of scarily smart data scientists

• Integrating analytics everywhere

Page 4: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Big Data

4

• HDFS - 1100TB

• Amazon S3 - 275TB

• Amazon Glacier - 150TB

• DynamoDB -12TB

• Vertica - 2TB

• Cassandra - 1.5TB

• SOLR/Lucene - 400GB

• Zookeeper

• MySQL

• Postgres

• Redis

• CouchDB

Page 5: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Data Flow

5

• Flows between different data store types (many include historical data too) • Systems of Record (SOR)

• Both root nodes and leaf nodes

• Derived data stores (mostly MVCC) for:

• Real time customer facing queries

• Real time analytics

• Alerting

• Offline analytics

• Reporting

• Debugging

• Mixture of dumps and deltas

• We have derived SORs • Cached smaller subset records/fields for a specific purpose

• SORs in multiple data centers - some derived SORs shared

• Data flow is graph not a tree - feedback

Page 6: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Goals

6

• Reduce latency <15 mins for customer facing data

• Reduce copying and duplication of data • Network/Storage/Time costs

• More streaming & deltas, less dumps and derived SORs

• Want multi-purpose, multi-tenant central store • Something rock solid

• Something that can handle lots of data fast

• Something that can do random access and bulk operations

• Use for all data store types on previous slide

• (Over?)build it; they will come

• Consolidate rest on • HDFS, Vertica, Postgres, S3, Glacier, SOLR/Lucene

Page 7: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Why Cassandra?

7

• Regarded as rock solid • No single point of failure

• Active development & open source Java

• Good fit for the type of data we wanted to store

• Ease of configuration; all nodes are the same

• Easily tunable consistency at application level

• Easy control of sharding at application level

• Drivers for all our languages (we're mostly JVM but also node)

• Data locality with other tools

• Good cross data center support

Page 8: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Evolution

8

• July 2013 (alpha on C* 1.1)

• September 2013 (MTC-1 on C* 2.0.0) • First use case (a nasty one) - talk about it later

• Stress/Destructive testing

• Found and helped fix a few bugs along the way

• Learned a lot about tuning and operations

• Half nodes down at one point

• Corrupted SSTables on one node

• We’ve been cautious • Started with internal facing only use (don’t need 100% uptime)

• Moved to external facing use but with ability to fall back off C* in minutes

• Getting braver • C* is only SOR and real time customer facing store for some cases now

• We have on occasion custom built C* with cherry-picked patches

Page 9: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

HW Specs MTC-1

9

• Remember we want to build for the C* future • 6 nodes

• 16x cores (Sandy Bridge)

• 256G RAM

• Lots of disk cache and mem-mapped NIO buffers

• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)

• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)

• RAID1 OS drives

• 4x gigabit ethernet

Page 10: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

SW Specs MTC-1

10

• CentOS 6.5

• Cassandra 2.0.5

• JDK 1.7.0_60-b19 • 8 gig young generation / 6.4 gig eden

• 16 gig old generation

• Parallel new collector

• CMS collector

• Sounds like overkill but we are multi-tenant and have spiky loads

Page 11: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

General

11

• LOCAL_QUORUM for reads and writes

• Use LZ4 compression

• Use key cache (not row cache)

• Some SizeTiered, some Leveled CompactionStrategy

• Drivers • Athena (Scala / binary)

• Astyanax 1.56.48 (Java / thrift)

• node-cassandra-cql (Node / binary)

Page 12: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 1 - Search API - Problem

12

• 40 million records (including duplicates per VIN) in HDFS

• Map/Reduce to 7 million SOLR XML updates in HDFS

• Not delta today because of map/reduce like business rules

• Export to SOLR XML from HDFS to local FS

• Re-index via SOLR

• 40 gig SOLR index - at least 3 slaves

• OKish every few hours, not every 15 minutes • Even though we made very fast parallel indexer

• % of stored data read per indexing is getting smaller

Page 13: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 1 - Search API - Solution

13

• Indexing in hadoop • SOLR(Lucene) segments created (no stored fields)

• Job option for fallback to stored fields in SOLR index

• Stored fields go to C* as JSON directly from hadoop

• Astyanax - 1MB batches - LOCAL_QUORUM

• Periodically create new table(CF) with full data baseline (clustering) column

• 200MB/s 3 replicas continuously for one to two minutes

• 40000 partition keys/s (one per record)

• Periodically add new (clustering) column to table with deltas from latest dump

• Delta data size is 100x smaller and hits many fewer partition keys

• Keep multiple recent tables for rollback (bad data more than recovery)

• 2 gig SOLR index (20x smaller)

Page 14: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 1 - Search API - Solution

14

• Very bare bones - not even any metadata :-(

• Thrift style

• Note we use blob • Everything is UTF-8

• Avro - Utf8

• Hadoop - Text

• Astyanax - ByteBuffer

• Most JVM drivers try to convert text to String

CREATE TABLE "20140618084015_20140618_081920_1403072360" (! key text,! column1 blob,! value blob,! PRIMARY KEY (key, column1)!) WITH COMPACT STORAGE;

Page 15: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 1 - Search API - Solution

15

• Stored fields cached in SOLR JVM (verification/warm up tests)

• MVCC to prevent read-from-future • Single clustering key limit for the SOLR core

• Reads fallback from LOCAL_QUORUM to LOCAL_ONE • Better to return something even a subset of results

• Never happened in production though

• Issues • Don’t recreate table/CF until C* 2.1

• Early 2.0.x and Astyanax don’t like schema changes

• Create new tables via CQL3 via Astyanax

• Monitoring harder since we now use UUID for table name

• Full (non delta) index write rate strains GC and causes some hinting

• C* remains rock solid

• We can constrain by mapper/reducer count, and will probably add zookeeper mutex

Page 16: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 1.5 - RESA

16

• Newer version of real estate

• Fully streaming delta pipeline (RabbitMQ)

• Field level SOLR index updates (include latest timestamp)

• C* row with JSON delta for that timestamp

• History is used in customer facing features

• Note this is really the same table as thrift one

CREATE TABLE for_sale (! id text,! created_date timestamp,! delta_json text! PRIMARY KEY (id, created_date)!) !

Page 17: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 2 - Feed Management - Problem

17

• Thousands of feeds of different size and frequency

• Incoming feeds must be “polished”

• Geocoding must be done

• Images must be made available in S3

• Need to reprocess individual feeds

• Full output records are munged from asynchronously updated parts

• Previously huge HDFS job • 300M inputs for 70M full output records

• Records need all data to be “ready” for full output

• Silly because most work is redundant from previous run

• Only help partitioning is by brittle HDFS directory structures

Page 18: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Use Case 2 - Feed Management - Solution

18

• Scala & Akka & Athena (large throughput - high parallelism)

• Compound partition key (2^n shards per feed)

• Spreads data - limits partition “row” length

• Read entire feed without key scan - small IN clause

• Random access writes

• Any sub-field may be updated asynchronously

• Munged record emitted to HDFS whenever “ready”

CREATE TABLE feed_state (! feed_name text,! feed_record_id_shard int,! record_id uuid,! raw_record text,! polished_data text,! geocode_data text,! image_status text,! ...! PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!)

Page 19: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Monitoring

19

• OpsCenter

• log4j/syslog/graylog • Email alerts

• nagios/zabbix

• Graphite (autogen graph pages) • Machine stats via collectl, JVM from codahale

• Cassandra stats from codahale

• Suspect possible issue with hadoop using same coordinator nodes

• GC logs

• Visual VM

Page 20: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

General Issues / Lessons Learned

20

• GC issues • Old generation fragmentation causes eventual promotion failure

• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)

• Thrift API with bulk load probably not helping, but fragmentation is inevitable

• Some slow initial mark and remark STW pauses

• We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-)

• As said we aim to be multi-tenant • Avoid client stupidity, but otherwise accommodate any client behavior

• GC now well tuned

• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day

• Cassandra and it’s own dog food • Can’t wait for hints to be commit log style regular file (C* 3.0)

• Compactions in progress table

• OpsCenter rollup - turned off for search api tables

Page 21: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

General Issues / Lessons Learned

21

• Don’t repair things that don’t need them

• We also run -pr -par repair on each node

• Beware when not following the rules • We were knowingly running on potentially buggy minor versions

• If you don’t know what you’re doing you will likely screw up • Fortunately for us C* has always kept running fine

• It is usually pretty easy to fix with some googling

• Deleting data is counter-intuitively often a good fix!

Page 22: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Future

22

• Upgrade 2.0.x to use static columns

• User defined types :-)

• De-duplicate data into shared storage in C*

• Analytics via data-locality • Hadoop, Pig, Spark/Scalding, R

• More cross data center

• More tuning

• Full streaming pipeline with C* as side state store

Page 23: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Athena

23

• Why would we do such an obviously crazy thing? • Need to support async, reactive applications across different problem domains

• Real-time API used by several disparate clients (iOS, Node.js, …)

• Ground-up implementation of the CQL 2.0 binary protocol • Scala 2.10/2.11

• Akka 2.3.x

• Fully async, nonblocking API • Has obvious advantages but requires different paradigm

• Implemented as an extension for Akka-IO

• Low-level actor based abstraction

• Cluster, Host and Connection actors

• Reasonably stable

• High-level streaming streaming Session API

Page 24: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Athena

24

• Next steps • Move off of Play Iteratees and onto Akka Reactive Streams

• Token based routing

• Client API very much in flux - suggestions are welcome!

!

• https://github.com/vast-engineering/athena • Release of first beta milestone to Sonatype Maven repository imminent

• Pull requests welcome!

Page 25: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 201425

Appendix

Page 26: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

GC Settings

26

-Xms24576M -Xmx24576M -Xmn8192M -Xss228k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+HeapDumpOnOutOfMemoryError -XX:+CMSPrintEdenSurvivorChunks -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1

Page 27: Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Cassandra at VastGraham Sanderson - CTO, David Pratt - Director of Applications

27