Scaling for Humongous amounts of data with...

Scaling for Humongous amounts of data with MongoDB

Alvin Richards

Technical Director, EMEAalvin@10gen.com

@jonnyeightalvinonmongodb.com

http://bit.ly/OT71M4

From here...

http://bit.ly/Oxcsis

...to here...

...without one of these.

http://bit.ly/cnP77L

Warning!

• This is a technical talk• But MongoDB is very simple!

Solving real world data problems with MongoDB• Effective schema design for scaling

• Linking versus embedding • Bucketing• Time series

• Implications of sharding keys with alternatives• Read scaling through replication• Challenges of eventual consistency

•!Founded!in!2007• Dwight!Merriman,!Eliot!Horowitz

•"$73M+!in!funding• Flybridge,!Sequoia,!Union!Square,!NEA

•"Worldwide!Expanding!Team• 170+!employees

• NY,!CA,!UK!and!Australia

Set$the$direc*on$&$contribute$code$to$

MongoDB

Foster$community$

&$ecosystem

Provide$MongoDB$cloud$

services

Provide$MongoDB$support$services

A quick word from MongoDB sponsors, 10gen

Since the dawn of the RDBMS

1970 2012

Main memory Intel 1103, 1k bits 4GB of RAM costs $25.99

Mass storage IBM 3330 Model 1, 100 MB 3TB Superspeed USB for $129

Microprocessor Nearly – 4004 being developed; 4 bits and 92,000 instructions per second

Westmere EX has 10 cores, 30MB L3 cache, runs at 2.4GHz

More recent changesA decade ago Now

Faster Buy a bigger server Buy more servers

Faster storage A SAN with more spindles

More reliable storage More expensive SAN More copies of local storage

Deployed in Your data center The cloud – private or public

Large data set Millions of rows Billions to trillions of rows

Development Waterfall Iterative

http://bit.ly/Qmg8YD

Is Scaleout Mission Impossible?

• What about the CAP Theorem?• Brewer's theorem• Consistency, Availability, Partition Tolerance

• It says if a distributed system is partitioned, you can’t be able to update everywhere and have consistency

• So, either allow inconsistency or limit where updates can be applied

• Cost effective operationalize abundant data (clickstreams, logs, tweets, ...)

• Relaxed transactional semantics enable easy scale out

• Auto Sharding for scale down and scale up

• Applications store complex data that is easier to model as documents

• Schemaless DB enables faster development cycles

What MongoDB solves

Agility

Flexibility

depth of functionality

ance •memcached

•key/value

• RDBMS

Design Goal of MongoDB

Schema Design at Scale

Design Schema for Twitter

• Model each users activity stream• Users

• Name, email address, display name• Tweets

• Text• Who• Timestamp

Solution ATwo Collections - Normalized// users - one doc per user{ _id: "alvin", email: "alvin@10gen.com", display: "jonnyeight"}

// tweets - one doc per user per tweet{ user: "bob", tweet: "20111209-1231", text: "Best Tweet Ever!", ts: ISODate("2011-09-18T09:56:06.298Z")}

Solution BEmbedded - Array of Objects// users - one doc per user with all tweets{ _id: "alvin", email: "alvin@10gen.com", display: "jonnyeight", tweets: [! {! ! user: "bob",! ! tweet: "20111209-1231",! ! text: "Best Tweet Ever!", ts: ISODate("2011-09-18T09:56:06.298Z")! } ]}

Embedding

• Great for read performance

• One seek to load entire object

• One roundtrip to database

• Object grows over time when adding child objects

Linking or Embedding?

Linking can make some queries easy

// Find latest 50 tweets for "alvin"> db.tweets.find( { _id: "alvin" } ) .sort( { ts: -1 } ) .limit(10)

But what effect does this have on the systems?

Collection 1

Index 1

Virtual Address Space 1

Collection 1

Index 1 This is your virtual memory size

(mapped)

Physical RAM

Collection 1

Index 1

This is your resident

memory size

Physical RAM

DiskCollection 1

Index 1

Physical RAM

DiskCollection 1

Index 1

100 ns

10,000,000 ns

Physical RAM

DiskCollection 1

Index 1

db.tweets.find( { _id: "alvin" } ) .sort( { ts: -1 } ) .limit(10)

Linking = Many Random Reads + Seeks

Physical RAM

DiskCollection 1

Index 1

Embedding = Large Sequential Read

db.tweets.find( { _id: "alvin" } )

Problems

• Large sequential reads• Good: Disks are great at Sequential reads• Bad: May read too much data

• Many Random reads• Good: Easy of query• Bad: Disks are poor at Random reads (SSD?)

Solution CBuckets// tweets : one doc per user per day> db.tweets.findOne()

{ _id: "alvin-2011/12/09", email: "alvin@10gen.com", tweets: [ ! { user: "Bob",! tweet: "20111209-1231",! text: "Best Tweet Ever!" } , ! { author: "Joe",! date: "May 27 2011",! text: "Stuck in traffic (again)" } ]! }

Solution CLast 10 Tweets

// Get the latest bucket, slice the last 10 tweets

db.tweets.find( { _id: "alvin-2011/12/09" }, { tweets: { $slice : 10 } } ) .sort( { _id: -1 } ) .limit(1)

Physical RAM

DiskCollection 1

Index 1

db.tweets.find( { _id: "alvin-2011/12/09" }, { tweets: { $slice : 10 } } ) .sort( { _id: -1 } ) .limit(1)

Bucket = Small Sequential Read

Sharding - Goals

• Data location transparent to your code• Data distribution is automatic• Data re-distribution is automatic• Aggregate system resources horizontally• No code changes

Sharding - Range distribution

shard01 shard02 shard03

sh.shardCollection("test.tweets",3{_id:31}3,3false)

Sharding - Range distribution

a-i j-r s-z

Sharding - Splits

a-i ja-jz s-z

Sharding - Splits

a-i ja-ji s-z

Sharding - Auto Balancing

a-i ja-ji s-z

Sharding - Auto Balancing

a-i ja-ji s-z

How does sharding effect Schema Design?• Sharding key choice• Access patterns (query versus write)

Sharding Key

{ photo_id : ???? , data : <binary> }

• What’s the right key?• auto increment• MD5( data )• month() + MD5( data )

• Only have to keep small portion in ram• Right shard "hot" • Time Based

• ObjectId• Auto Increment

Right balanced access

• Have to keep entire index in ram• All shards "warm"

• Hash

Random access

• Have to keep some index in ram• Some shards "warm"

•Month + Hash

Segmented access

Solution AShard by a single identifier

{ _id : "alvin", // shard key email: "alvin@10gen.com", display: "jonnyeight" li: "alvin.j.richards", tweets: [ ... ] }

Shard on { _id : 1 } Lookup by _id routed to 1 node Index on { “email” : 1 }

Sharding - Routed Query

a-i ja-ji s-z

find(3{_id:3"alvin"}3)

a-i ja-ji s-z

find(3{_id:3"alvin"}3)

Sharding - Scatter Gather

a-i ja-ji s-z

find(3{3email:3"alvin@10gen.com"3}3)

Sharding - Scatter Gather

a-i ja-ji s-z

find(3{3email:3"alvin@10gen.com"3}3)

Multiple Identities

• User can have multiple identities• twitter name• email address• etc.

•What is the best sharding key & schema design?

Solution BShard by multiple identifiersidentities{ type: "_id", val: "alvin", info: "1200-42"}{ type: "em", val: "alvin@10gen.com", info: "1200-42"}{ type: "li", val: "alvin.j.richards",info: "1200-42"}

tweets{ _id: "1200-42", tweets : [ ... ]}

• Shard identities on { type : 1, val : 1 }• Lookup by type & val routed to 1 node• Can create unique index on type & val• Shard info on { _id: 1 }• Lookup info on _id routed to 1 node

shard01 shard02 shard03type: emval: a-q

type: emval: r-z

type: _idval: a-z

type: lival: s-z

type: lival: a-c

type: lival: d-r

"Min"-"1100"

"1100"-"1200"

"1200"-"Max"

type: emval: r-z

type: _idval: a-z

type: lival: s-z

type: lival: a-c

type: lival: d-r

"Min"-"1100"

"1100"-"1200"

"1200"-"Max"

find(3{3type:3"em",333333333val:3"alvin@10gen.com3}3)

type: emval: r-z

type: _idval: a-z

type: lival: s-z

type: lival: a-c

type: lival: d-r

"Min"-"1100"

"1100"-"1200"

"1200"-"Max"

find(3{3type:3"em",333333333val:3"alvin@10gen.com3}3)

find(3{3_id:3"1200C42"3}3)

Sharding - Caching

shard01

300 GB

96 GB Mem3:1 Data/Mem

Aggregate Horizontal Resources

a-i j-r s-z

100 GB 100 GB 100 GB

Auto Sharding - Summary

• Fully consistent• Application code unaware of data location• Zero code changes• Shard by Compound Key, Tag, Hash (2.4)• Add capacity

• On-line• When needed• Zero downtime

Time Series Data

• Records votes by• Day, Hour, Minute

• Show time series of votes cast

Solution ATime Series// Time series buckets, hour and minute sub-docs{ _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, hourly: { 0: 3, 1: 14, 2: 19 ... 23: 72 }, minute: { 0: 0, 1: 4, 2: 6 ... 1439: 0 }}

// Add one to the last minute before midnight> db.votes.update( { _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.037Z") }, { $inc: { "hourly.23": 1 }, $inc: { "minute.1439": 1 }, $inc: { "daily": 1 } } )

• Sequence of key/value pairs• NOT a hash map• Optimized to scan quickly

BSON Storage

...0 1 2 3 1439

What is the cost of update the minute before midnight?

• Can skip sub-documents

BSON Storage

How could this change the schema?

... ...59

1 231380 143960 ... 119

Solution BTime Series// Time series buckets, each hour a sub-document{ _id: "20111209-1231", ts: ISODate("2011-12-09T00:00:00.000Z") daily: 67, minute: { 0: { 0: 0, 1: 7, ... 59: 2 }, ... 23: { 0: 15, ... 59: 6 } }}

// Add one to the last second before midnight> db.votes.update( { _id: "20111209-1231" }, ts: ISODate("2011-12-09T00:00:00.000Z") }, { $inc: { "minute.23.59": 1 }, $inc: { daily: 1 } } )

Replica Sets

• Data Protection• Multiple copies of the data• Spread across Data Centers, AZs

• High Availability• Automated Failover• Automated Recovery

Replica Sets

Primary

Secondary

Asynchronous Replication

Replica Sets

Primary

Secondary

Replica Sets

Primary

Secondary

Automatic Election of new Primary

Replica Sets

Recovering

Primary

Secondary

New primary serves data

Replica Sets

Secondary

Primary

Secondary

Replica Sets - Summary

• Data Protection• High Availability• Scaling eventual consistent reads• Source to feed other systems

• Backups• Indexes (Solr etc.)

Types of Durability with MongoDB• Fire and forget• Wait for error • Wait for fsync• Wait for journal sync • Wait for replication

Least durability - Don't use!

Driver Primary

apply3in3memory

More durability

Driver Primary

apply3in3memory

Secondary

replicate

getLastError

Durability SummaryMemory Journal Secondary

Other Data Center

Default"Fire & Forget"

w=1j=true

w="majority"w=n

w="myTag"

Less More

Eventual ConsistencyUsing Replicas for ReadsslaveOk()

• driver will send read requests to Secondaries• driver will always send writes to Primary

Java examples•3DB.slaveOk()•3Collection.slaveOk()•3find(q).addOption(Bytes.QUERYOPTION_SLAVEOK);

Understanding EventualConsistency

Primary

Secondary

Application#1

Insert UpdateRead

Reads v2

Reads v1

v1 not present

Understanding EventualConsistency

Primary

Secondary

Application#1

Application#2

Read Read Read Read

Product & Roadmap

The Evolution of MongoDB

2.2Aug ‘12

2.4 winter ‘12

2.0Sept ‘11

1.8March ‘11

Journaling

Sharding and Replica set enhancements

Spherical geo search

Index enhancements to improve size and performance

Authentication with sharded clusters

Replica Set Enhancements

Concurrency improvements

Aggregation Framework

Multi-Data Center Deployments

Improved Performance and Concurrency

Scaling for Humongous amounts of data with...

Documents

Big Data: Challenges and Opportunities - GOTO Bloggotocon.com/dl/goto-aar-2012/slides/RobertoV... · Big Data: Challenges and Opportunities ... (ZB) by 2020. Twitter generates 7+

IDIOMS FOR DISTRIBUTED APPLICATIONS WITH ELIXIRgotocon.com/dl/goto-aar-2014/slides/JosValim...Functional programming ... elixir. defmodule MathTest do use ExUnit.Case! test "basic

7 Things: How to make good teams great - GOTO Conferencegotocon.com/dl/goto-aar-2012/slides/SvenPeters_7Thing... · 2012. 10. 1. · Sven Peters Atlassian. 7 ting der laver gode teams

IT’S ALL A NUMBERS GAMEgotocon.com › dl › goto-aar-2012 › slides › MartinThompson... · 2. Performance Test & Profile •Component Performance Tests •System Performance

Xamarin.Forms: Native iOS, Android, and Windows Phone apps ...gotocon.com/dl/goto-aar-2014/slides/JamesMontemagno_XamarinForm… · Xamarin.Forms: Native iOS, Android, and Windows

New CONTINOUS DELIVERY: TALES FROM WINDOWSLANDgotocon.com/dl/goto-aar-2013/slides/RachelLaycock... · 2013. 10. 8. · TALES FROM WINDOWSLAND Rachel Laycock (@rachellaycock) ThoughtWorks!

Embedded Systems - Embodied Agents, Robot …gotocon.com/dl/goto-aar-2012/slides/OleCaprani_EmbeddedSystems... · Timers PDM 12C TWI USB ATMega ontroller . lejos.nxt Class LCD java

What goes wrong when thousands of engineers share the same ...gotocon.com/dl/goto-aar-2013/slides/EranMesseri_WhatGoesWrong… · Continuous integration with presubmit capabilities

ClojureScript - GOTO Bloggotocon.com/dl/goto-aar-2013/slides/DavidNolen_ClojureScriptLisps... · How big is Hello World? Wednesday, October 2, 13. 2 LOC Wednesday, October 2, 13

DEFENSIVE PROGRAMMING - GOTO Conferencegotocon.com/dl/goto-aar-2013/slides/NiallMerrigan... · 2013. 10. 1. · Image Credits 1. Security Fail - 2. Fail - 3. Password

Windows Phone development - gotocon.comgotocon.com/dl/goto-aar-2013/slides/RajenKishna_WindowsPhoneApp... · Call Skype accounts for free via audio/video ... Built-in maps with core

Intro to Neo4j and Graph Databases - gotocon.comgotocon.com/dl/goto-aar-2014/slides/DavidMontag_LearnHowGraphs... · Intro to Neo4j ! and Graph Databases David Montag ... WWW Indexing

Big Data Fuels IT Architecture Evolutiongotocon.com/dl/goto-aar-2014/slides/EvaAndreasson_and_KevlinHenney_and_OlaBini_and...The Monolithic Architecture Simple at first In-process

GOTO 2012gotocon.com/dl/goto-aar-2012/slides/MartinNally... · 2012. 10. 4. · – Distracted by day-to-day delivery pressures – 78% – Tools don’t integrate properly – 62%

Welcome to the session - gotocon.comgotocon.com/dl/goto-aar-2013/slides/RickardBckman_JavaFlight... · Rickard Bäckman HotSpot JVM Compiler Team, Java Platform Java Flight Recorder

GOTO Fast Delivery - GOTO Conferencegotocon.com/dl/goto-aar-2014/slides/AdrianCockcroft_FastDelivery.pdfNon-Destructive Production Updates “Immutable Code” Service Pattern Existing

Making the JVM fly - GOTO Conferencegotocon.com/dl/goto-aar-2012/slides/JohnDavies_MakingTheJVMFly... · •In this case Java’s String can be your friend ... •It’s very easy

ESCAPING A RUT WITH ARRAY THINKING - GOTO Bloggotocon.com/dl/goto-aar-2012/slides/DavidLeibs_ArrayThinking...ESCAPING A RUT WITH ARRAY THINKING David Leibs Oracle Labs Monday, October

First things first, sign up! - GOTO Conferencegotocon.com/dl/goto-aar-2013/slides/EricD.Schabell...OpenShift, a little history Nov 2010 – Makara acquired In 2011 – merged into

Thoughts on Polyglotism - GOTO Conferencegotocon.com/dl/goto-aar-2012/slides/StefanTilkov... · 2012-10-03 · Data structures vs. objects (def p1 [3 4]) Immutable Reusable Compatible