17
SHIFT.com Migrating from MongoDB to Cassandra by: Blake Eggleston & Jon Haddad

Cassandra meetup slides - Oct 15 Santa Monica Coloft

Embed Size (px)

DESCRIPTION

Slides from our presentation at the Santa Monica Coloft on our Migration from MongoDB to Cassandra.

Citation preview

Page 1: Cassandra meetup slides - Oct 15 Santa Monica Coloft

SHIFT.comMigrating from MongoDB to Cassandra

by: Blake Eggleston & Jon Haddad

Page 2: Cassandra meetup slides - Oct 15 Santa Monica Coloft

What is SHIFT.com?

Shift is a platform that enables marketers to communicate across organizations and departments in one single place.

It’s also an open application platform with a set of applications built on top of it that can communicate with one another.

Page 3: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Initial Stack

● Python○ Flask○ Celery

● MongoDB○ mongoengine

● Neo4j / Titan○ Bulbs○ thunderdome

● Redis● AWS

○ m1.xlarge for mongo

Page 4: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Current Stack

● Python○ still flask○ still celery○ gevent (it rocks)

● Cassandra○ 1.2.6○ cqlengine

● ElasticSearch● Redis

○ jondis● AWS

○ m1.xlarge

Page 5: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Why did we move to Cassandra?

● Operational Benefits○ Adding and removing nodes is much easier,

compared to Mongo’s shards● Control over our Data on Disk (LSMT)● Love CQL3● Long term scalability

○ Scales Linearly○ Multi DC Support Baked in

Page 6: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Migration Goals

● Zero downtime○ We wanted to roll out Cassandra without any

service interruptions● No loss of performance

○ By carefully structuring our schema we were able to match MongoDB’s performance.

Page 7: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Migration Strategy

Page 8: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Benefits of CQL3

● Easy to understand if you’re coming from RDBMS

● Collections○ sets, lists, maps

● Batch Queries● Clustering Keys

○ Handles ordering of logical rows○ Saved us from column name management scheme

and allowed us to focus on our data

Page 9: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Physical vs Logical Row

Page 10: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Single Row

Page 11: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Clustered Row

Page 12: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Data Modelling Patterns

● considerations: working with Mongo’s dbrefs and optimizing layout on disk

● structured tables as materialized views of the queries we planned on using

● moving multiple documents into a single physical row

● creating supporting index tables for looking up logical rows

Page 13: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Time Series: Message Stream

● Users have tens of thousands of messages● Each users message stream is specific to

them, like a twitter feed● This is Cassandra’s strength - Time Series● Considered Redis - but poor for multi-dc

create table news_feed (

user_id uuid,

message_id timeuuid,

message,

primary key (user_id, message_id));

Page 14: Cassandra meetup slides - Oct 15 Santa Monica Coloft

cqlengine

● cqlengine.org● the Python CQL3 object-row mapper● exposes CQL3 tables as Python classes● maps columns to properties● builds CQL queries

#model definitionclass ExampleModel(Model): example_id = columns.UUID(primary_key=True) example_type = columns.Integer(index=True) created_at = columns.DateTime() description = columns.Text(required=False)

# example queryExampleModel.objects(example_type=1)

Page 15: Cassandra meetup slides - Oct 15 Santa Monica Coloft

Improvements from moving to C*

● Operationally we’ve had zero problems● Outstanding Performance● Easy to build new features● Community has been amazing (mailing list

and #cassandra)

Page 16: Cassandra meetup slides - Oct 15 Santa Monica Coloft

misc tips

● leveled compaction - good for read heavy workloads

● use secondary indexes sparingly, understand how they work and when to use them

● to reiterate, think about how you’re going to query your data

● use elastic search / solr for ad hoc queries