Austin Cassandra Users 6/19: Apache Cassandra at Vast

June 19, 2014

Cassandra at VastGraham Sanderson - CTO, David Pratt - Director of Applications

1

June 19, 2014

Introduction

2

• Don’t want this to be a data modeling talk

• We aren't experts - we are learning as we go

• Hopefully this will be useful to both you and us • Informal, questions as we go

• We will share our experiences so far moving to Cassandra • We are working on a bunch existing and new projects

• We'll talk about 2 1/2 of them

• Some dev stuff, some ops stuff

• Some thoughts for the future

• Athena Scala Driver

June 19, 2014

Who is Vast?

3

• Vast operates while-label performance based marketplaces for publishers; and delivers big data mobile applications for automotive and real estate sales professionals

• “Big Data for Big Purchases”

• Marketplaces • Large partner sites, including AOL, CARFAX, TrueCar, Realogy, USAA,

Yahoo

• Hundreds of smaller partner sites

• Analytics • Strong team of scarily smart data scientists

• Integrating analytics everywhere

June 19, 2014

Big Data

4

• HDFS - 1100TB

• Amazon S3 - 275TB

• Amazon Glacier - 150TB

• DynamoDB -12TB

• Vertica - 2TB

• Cassandra - 1.5TB

• SOLR/Lucene - 400GB

• Zookeeper

• MySQL

• Postgres

• Redis

• CouchDB

June 19, 2014

Data Flow

5

• Flows between different data store types (many include historical data too) • Systems of Record (SOR)

• Both root nodes and leaf nodes

• Derived data stores (mostly MVCC) for:

• Real time customer facing queries

• Real time analytics

• Alerting

• Offline analytics

• Reporting

• Debugging

• Mixture of dumps and deltas

• We have derived SORs • Cached smaller subset records/fields for a specific purpose

• SORs in multiple data centers - some derived SORs shared

• Data flow is graph not a tree - feedback

June 19, 2014

Goals

6

• Reduce latency <15 mins for customer facing data

• Reduce copying and duplication of data • Network/Storage/Time costs

• More streaming & deltas, less dumps and derived SORs

• Want multi-purpose, multi-tenant central store • Something rock solid

• Something that can handle lots of data fast

• Something that can do random access and bulk operations

• Use for all data store types on previous slide

• (Over?)build it; they will come

• Consolidate rest on • HDFS, Vertica, Postgres, S3, Glacier, SOLR/Lucene

June 19, 2014

Why Cassandra?

7

• Regarded as rock solid • No single point of failure

• Active development & open source Java

• Good fit for the type of data we wanted to store

• Ease of configuration; all nodes are the same

• Easily tunable consistency at application level

• Easy control of sharding at application level

• Drivers for all our languages (we're mostly JVM but also node)

• Data locality with other tools

• Good cross data center support

June 19, 2014

Evolution

8

• July 2013 (alpha on C* 1.1)

• September 2013 (MTC-1 on C* 2.0.0) • First use case (a nasty one) - talk about it later

• Stress/Destructive testing

• Found and helped fix a few bugs along the way

• Learned a lot about tuning and operations

• Half nodes down at one point

• Corrupted SSTables on one node

• We’ve been cautious • Started with internal facing only use (don’t need 100% uptime)

• Moved to external facing use but with ability to fall back off C* in minutes

• Getting braver • C* is only SOR and real time customer facing store for some cases now

• We have on occasion custom built C* with cherry-picked patches

June 19, 2014

HW Specs MTC-1

9

• Remember we want to build for the C* future • 6 nodes

• 16x cores (Sandy Bridge)

• 256G RAM

• Lots of disk cache and mem-mapped NIO buffers

• 7x 1.2TB 10K RPM JBOD (4.2ms latency, 200MB/sec sequential each)

• 1x SSD commit volume (~100K IIOPs, 550MB/sec sequential)

• RAID1 OS drives

• 4x gigabit ethernet

June 19, 2014

SW Specs MTC-1

10

• CentOS 6.5

• Cassandra 2.0.5

• JDK 1.7.0_60-b19 • 8 gig young generation / 6.4 gig eden

• 16 gig old generation

• Parallel new collector

• CMS collector

• Sounds like overkill but we are multi-tenant and have spiky loads

June 19, 2014

General

11

• LOCAL_QUORUM for reads and writes

• Use LZ4 compression

• Use key cache (not row cache)

• Some SizeTiered, some Leveled CompactionStrategy

• Drivers • Athena (Scala / binary)

• Astyanax 1.56.48 (Java / thrift)

• node-cassandra-cql (Node / binary)

June 19, 2014

Use Case 1 - Search API - Problem

12

• 40 million records (including duplicates per VIN) in HDFS

• Map/Reduce to 7 million SOLR XML updates in HDFS

• Not delta today because of map/reduce like business rules

• Export to SOLR XML from HDFS to local FS

• Re-index via SOLR

• 40 gig SOLR index - at least 3 slaves

• OKish every few hours, not every 15 minutes • Even though we made very fast parallel indexer

• % of stored data read per indexing is getting smaller

June 19, 2014

Use Case 1 - Search API - Solution

13

• Indexing in hadoop • SOLR(Lucene) segments created (no stored fields)

• Job option for fallback to stored fields in SOLR index

• Stored fields go to C* as JSON directly from hadoop

• Astyanax - 1MB batches - LOCAL_QUORUM

• Periodically create new table(CF) with full data baseline (clustering) column

• 200MB/s 3 replicas continuously for one to two minutes

• 40000 partition keys/s (one per record)

• Periodically add new (clustering) column to table with deltas from latest dump

• Delta data size is 100x smaller and hits many fewer partition keys

• Keep multiple recent tables for rollback (bad data more than recovery)

• 2 gig SOLR index (20x smaller)

June 19, 2014


14

• Very bare bones - not even any metadata :-(

• Thrift style

• Note we use blob • Everything is UTF-8

• Avro - Utf8

• Hadoop - Text

• Astyanax - ByteBuffer

• Most JVM drivers try to convert text to String

CREATE TABLE "20140618084015_20140618_081920_1403072360" (! key text,! column1 blob,! value blob,! PRIMARY KEY (key, column1)!) WITH COMPACT STORAGE;

June 19, 2014


15

• Stored fields cached in SOLR JVM (verification/warm up tests)

• MVCC to prevent read-from-future • Single clustering key limit for the SOLR core

• Reads fallback from LOCAL_QUORUM to LOCAL_ONE • Better to return something even a subset of results

• Never happened in production though

• Issues • Don’t recreate table/CF until C* 2.1

• Early 2.0.x and Astyanax don’t like schema changes

• Create new tables via CQL3 via Astyanax

• Monitoring harder since we now use UUID for table name

• Full (non delta) index write rate strains GC and causes some hinting

• C* remains rock solid

• We can constrain by mapper/reducer count, and will probably add zookeeper mutex

June 19, 2014

Use Case 1.5 - RESA

16

• Newer version of real estate

• Fully streaming delta pipeline (RabbitMQ)

• Field level SOLR index updates (include latest timestamp)

• C* row with JSON delta for that timestamp

• History is used in customer facing features

• Note this is really the same table as thrift one

CREATE TABLE for_sale (! id text,! created_date timestamp,! delta_json text! PRIMARY KEY (id, created_date)!) !

June 19, 2014

Use Case 2 - Feed Management - Problem

17

• Thousands of feeds of different size and frequency

• Incoming feeds must be “polished”

• Geocoding must be done

• Images must be made available in S3

• Need to reprocess individual feeds

• Full output records are munged from asynchronously updated parts

• Previously huge HDFS job • 300M inputs for 70M full output records

• Records need all data to be “ready” for full output

• Silly because most work is redundant from previous run

• Only help partitioning is by brittle HDFS directory structures

June 19, 2014

Use Case 2 - Feed Management - Solution

18

• Scala & Akka & Athena (large throughput - high parallelism)

• Compound partition key (2^n shards per feed)

• Spreads data - limits partition “row” length

• Read entire feed without key scan - small IN clause

• Random access writes

• Any sub-field may be updated asynchronously

• Munged record emitted to HDFS whenever “ready”

CREATE TABLE feed_state (! feed_name text,! feed_record_id_shard int,! record_id uuid,! raw_record text,! polished_data text,! geocode_data text,! image_status text,! ...! PRIMARY KEY ((feed_name, feed_record_id_shard), record_id)!)

June 19, 2014

Monitoring

19

• OpsCenter

• log4j/syslog/graylog • Email alerts

• nagios/zabbix

• Graphite (autogen graph pages) • Machine stats via collectl, JVM from codahale

• Cassandra stats from codahale

• Suspect possible issue with hadoop using same coordinator nodes

• GC logs

• Visual VM

June 19, 2014

General Issues / Lessons Learned

20

• GC issues • Old generation fragmentation causes eventual promotion failure

• Usually of 1MB Memtable “slabs” - These can be off heap in C* 2.1 :-)

• Thrift API with bulk load probably not helping, but fragmentation is inevitable

• Some slow initial mark and remark STW pauses

• We do have a big young gen - New -XX:+ flags in 1.7.0_60 :-)

• As said we aim to be multi-tenant • Avoid client stupidity, but otherwise accommodate any client behavior

• GC now well tuned

• 1 compacting GC at off times/day, very rare 1 sec pauses/day, handful >0.5 sec/day

• Cassandra and it’s own dog food • Can’t wait for hints to be commit log style regular file (C* 3.0)

• Compactions in progress table

• OpsCenter rollup - turned off for search api tables

June 19, 2014

General Issues / Lessons Learned

21

• Don’t repair things that don’t need them

• We also run -pr -par repair on each node

• Beware when not following the rules • We were knowingly running on potentially buggy minor versions

• If you don’t know what you’re doing you will likely screw up • Fortunately for us C* has always kept running fine

• It is usually pretty easy to fix with some googling

• Deleting data is counter-intuitively often a good fix!

June 19, 2014

Future

22

• Upgrade 2.0.x to use static columns

• User defined types :-)

• De-duplicate data into shared storage in C*

• Analytics via data-locality • Hadoop, Pig, Spark/Scalding, R

• More cross data center

• More tuning

• Full streaming pipeline with C* as side state store

June 19, 2014

Athena

23

• Why would we do such an obviously crazy thing? • Need to support async, reactive applications across different problem domains

• Real-time API used by several disparate clients (iOS, Node.js, …)

• Ground-up implementation of the CQL 2.0 binary protocol • Scala 2.10/2.11

• Akka 2.3.x

• Fully async, nonblocking API • Has obvious advantages but requires different paradigm

• Implemented as an extension for Akka-IO

• Low-level actor based abstraction

• Cluster, Host and Connection actors

• Reasonably stable

• High-level streaming streaming Session API

June 19, 2014

Athena

24

• Next steps • Move off of Play Iteratees and onto Akka Reactive Streams

• Token based routing

• Client API very much in flux - suggestions are welcome!

!

• https://github.com/vast-engineering/athena • Release of first beta milestone to Sonatype Maven repository imminent

• Pull requests welcome!

https://github.com/vast-engineering/athena

June 19, 201425

Appendix

June 19, 2014

GC Settings

26

-Xms24576M -Xmx24576M -Xmn8192M -Xss228k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:+HeapDumpOnOutOfMemoryError -XX:+CMSPrintEdenSurvivorChunks -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1

June 19, 2014

Cassandra at VastGraham Sanderson - CTO, David Pratt - Director of Applications

27

Technology

Austin Cassandra Users 6/19: Apache Cassandra at Vast