35
Sr. Solutions Architect, MongoDB Jake Angerman Sharding Time Series Data

MongoDB for Time Series Data: Sharding

  • Upload
    mongodb

  • View
    614

  • Download
    1

Embed Size (px)

DESCRIPTION

Learn how to shard time series data with this presentation.

Citation preview

Page 1: MongoDB for Time Series Data: Sharding

Sr. Solutions Architect, MongoDB

Jake Angerman

Sharding Time Series Data

Page 2: MongoDB for Time Series Data: Sharding

Let's Pretend We Are DevOps

What my friendsthink I do

What societythinks I do

What my Momthinks I do

What my bossthinks I do What I think I

doWhat I really do

DevOps

Page 3: MongoDB for Time Series Data: Sharding

Sharding Overview

Primary

Secondary

Secondary

Shard 1

Primary

Secondary

Secondary

Shard 2

Primary

Secondary

Secondary

Shard 3

Primary

Secondary

Secondary

Shard N

Query Router

Query Router

Query Router

……

Driver

Application

Page 4: MongoDB for Time Series Data: Sharding

Why do we need to shard?

• Reaching a limit on some resource– RAM (working set)– Disk space– Disk IO– Client network latency on writes (tag aware

sharding)– CPU

Page 5: MongoDB for Time Series Data: Sharding

Do we need to shard right now?• Two schools of thought:

1. Shard at the outset to avoid technical debt later2. Shard later to avoid complexity and overhead

today

• Either way, shard before you need to!– 256GB data size threshold published in

documentation– Chunk migrations can cause memory contention

and disk IOWorking SetFree RAM

Things seemed fine…

Working Set… then I

waited too long to shard

Page 6: MongoDB for Time Series Data: Sharding

Develop Nationwide traffic monitoring system

Page 7: MongoDB for Time Series Data: Sharding

Traffic sensors to monitor interstate conditions

• 16,000 sensors

• Measure

• Speed• Travel time• Weather, pavement, and traffic conditions

• Support desktop, mobile, and car navigation systems

Page 8: MongoDB for Time Series Data: Sharding

Model After NY State Solution

http://511ny.org

Page 9: MongoDB for Time Series Data: Sharding

{ _id: “900006:2014031206”,

data: [

{ speed: NaN, time: NaN },

{ speed: NaN, time: NaN },

{ speed: NaN, time: NaN },

...

],

conditions: {

status: "unknown",

pavement: "unknown",

weather: "unknown"

}

}

Sample Document Structure

Pre-allocated,60 element array of per-minute data

Page 10: MongoDB for Time Series Data: Sharding

> db.mdbw.stats()

{

"ns" : "test.mdbw",

"count" : 16000, // one hour's worth of documents

"size" : 65280000, // size of user data, padding included

"avgObjSize" : 4080,

"storageSize" : 93356032, // size of data extents, unused space included

"numExtents" : 11,

"nindexes" : 1,

"lastExtentSize" : 31354880,

"paddingFactor" : 1,

"systemFlags" : 1,

"userFlags" : 1,

"totalIndexSize" : 801248,

"indexSizes" : { "_id_" : 801248 },

"ok" : 1

}

collection stats

Page 11: MongoDB for Time Series Data: Sharding

Storage model spreadsheet

sensors 16,000years to keep data 6docs per day 384,000docs per year 140,160,000docs total across all years 840,960,000indexes per day 801248 bytesstorage per hour 63 MBstorage per day 1.5 GBstorage per year 539 GBstorage across all years 3,235 GB

Page 12: MongoDB for Time Series Data: Sharding

Why we need to shard now

539 GB in year one alone

1 2 3 4 5 60

500

1,000

1,500

2,000

2,500

3,000

3,500

YearTotal storage _x000d_(GB)

16,000 sensors today… … 47,000 tomorrow?

Page 13: MongoDB for Time Series Data: Sharding

What will our sharded cluster look like?

• We need to model the application to answer this question

• Model should include:– application write patterns (sensors)– application read patterns (clients)– analytic read patterns– data storage requirements

• Two main collections– summary data (fast query times)– historical data (analysis of environmental conditions)

Page 14: MongoDB for Time Series Data: Sharding

Option 1: Everything in one sharded cluster

Primary Primary Primary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Shard 2 Shard 3 Shard N

Primary

Secondary

Secondary

Shard 1Primary Shard

Primary

Secondary

Secondary

Shard 4

• Issue: prevent analytics jobs from affecting application

performance

• Summary data is small (16,000 * N bytes) and accessed

frequently

Page 15: MongoDB for Time Series Data: Sharding

Option 2: Distinct replica set for summaries

Primary Primary Primary

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Shard 1 Shard 2 Shard N

Primary

Secondary

Secondary

Replica set

Primary

Secondary

Secondary

Shard 3

• Pros: Operational separation between business

functions

• Cons: application must write to two different databases

Page 16: MongoDB for Time Series Data: Sharding

Application read patterns

• Web browsers, mobile phones, and in-car navigation devices

• Working set will be kept in RAM

• 5M subscribers * 1% active * 50 sensors/query * 1 device query/min = 41,667 reads/sec

• 41,667 reads/sec * 4080 bytes = 162 MB/sec

– and that's without any protocol overhead

• Gigabit Ethernet is ≈ 118 MB/sec

Primary

Secondary

Secondary

Replica set

1 Gbps

Page 17: MongoDB for Time Series Data: Sharding

Application read patterns (continued)

• Options– provision more bandwidth ($$

$)– tune application read pattern– add a caching layer– secondary reads from the

replica set

Primary

Secondary

Secondary

Replica set

1 Gbps

1 Gbps

1 Gbps

Page 18: MongoDB for Time Series Data: Sharding

Secondary Reads from the Replica Set• Stale data OK in this use case

• caution: read preference of secondary could be disastrous in a 3-replica set if a secondary fails!

• app servers with mixed read preferences of primary and secondary are operationally cumbersome

• Use nearest read preference to access all nodes

Primary

Secondary

Secondary

Replica set

1 Gbps

1 Gbps

1 Gbps

db.collection.find().readPref( { mode: 'nearest'} )

Page 19: MongoDB for Time Series Data: Sharding

Replica Set Tags• app servers in different data centers use

replica set tags plus read preference

nearest

• db.collection.find().readPref( { mode:

'nearest', tags: [ {'datacenter':

'east'} ] } )

east

Secondary

Secondary

Primary

> rs.conf()

{ "_id" : "rs0",

"version" : 2,

"members" : [

{ "_id" : 0,

"host" : "node0.example.net:27017",

"tags" : { "datacenter": "east" }

},

{ "_id" : 1,

"host" : "node1.example.net:27017",

"tags" : { "datacenter": "east" }

},

{ "_id" : 2,

"host" : "node2.example.net:27017",

"tags" : { "datacenter": "east" }

},

}

Page 20: MongoDB for Time Series Data: Sharding

eastcentralwest

Replica Set Tags• Enables geographic distribution

Secondary

Secondary

Primary

Page 21: MongoDB for Time Series Data: Sharding

eastcentralwest

Replica Set Tags• Enables geographic distribution

• Allows scaling within each data center

Secondary

Secondary

Secondary

Secondary

Secondary

Secondary

Primary

Secondary

Secondary

Page 22: MongoDB for Time Series Data: Sharding

Analytic read patterns

• How does an analyst look at the data on the sharded

cluster?

• 1 Year of data = 539 GB

2 4 6 8 10 12 14 16 180

50

100

150

200

250

300

Series1; 256

192

128

6432

Server RAM

Number of shards

Page 23: MongoDB for Time Series Data: Sharding

Application write patterns

• 16,000 sensors every minute = 267 writes/sec

• Could we handle 16,000 writes in one second?

– 16,000 writes * 4080 bytes = 62 MB

• Load test the app!

Page 24: MongoDB for Time Series Data: Sharding

Modeling the Application - summary

• We modeled:– application write patterns (sensors)– application read patterns (clients)– analytic read patterns– data storage requirements– the network, a little bit

Page 25: MongoDB for Time Series Data: Sharding

Shard Key

Page 26: MongoDB for Time Series Data: Sharding

Shard Key characteristics

• A good shard key has:– sufficient cardinality– distributed writes– targeted reads ("query isolation")

• Shard key should be in every query if possible

– scatter gather otherwise

• Choosing a good shard key is important!– affects performance and scalability– changing it later is expensive

Page 27: MongoDB for Time Series Data: Sharding

Hashed shard key• Pros:

– Evenly distributed writes

• Cons:– Random data (and index) updates can be IO

intensive– Range-based queries turn into scatter gather

Shard 1

mongos

Shard 2

Shard 3

Shard N

Page 28: MongoDB for Time Series Data: Sharding

Low cardinality shard key

• Induces "jumbo chunks"

• Examples: sensor ID

• Makes sense for some use cases besides this one

Shard 1

mongos

Shard 2

Shard 3

Shard N

[ a, b ) [ b, c ) [ c, d ) [ e, f )

Page 29: MongoDB for Time Series Data: Sharding

Ascending shard key

• Monotonically increasing shard key values cause "hot spots" on inserts

• Examples: timestamps, _id

Shard 1

mongos

Shard 2

Shard 3

Shard N

[ ISODate(…), $maxKey )

Page 30: MongoDB for Time Series Data: Sharding

Choosing a shard key for time series data• Consider compound shard key:

{arbitrary value, incrementing value}

• Best of both worlds – multi-hot spotting, targeted reads

Shard 1

mongos

Shard 2

Shard 3

Shard N

[ {V1, ISODate(A)}, {V1, ISODate(B)} ),[ {V1, ISODate(B)}, {V1, ISODate(C)} ),[ {V1, ISODate(C)}, {V1, ISODate(D)} ),…

[ {V4, ISODate(A)}, {V4, ISODate(B)} ),[ {V4, ISODate(B)}, {V4, ISODate(C)} ),[ {V4, ISODate(C)}, {V4, ISODate(D)} ),…

[ {V2, ISODate(A)}, {V2, ISODate(B)} ),[ {V2, ISODate(B)}, {V2, ISODate(C)} ),[ {V2, ISODate(C)}, {V2, ISODate(D)} ),…

[ {V3, ISODate(A)}, {V3, ISODate(B)} ),[ {V3, ISODate(B)}, {V3, ISODate(C)} ),[ {V3, ISODate(C)}, {V3, ISODate(D)} ),…

Page 31: MongoDB for Time Series Data: Sharding

What is our shard key?

• Let's choose: linkID, date– example: { linkID: 9000006, date: 2014031206 }– example: { _id: "900006:2014031206" }– this application's _id is in this form already, yay!

Page 32: MongoDB for Time Series Data: Sharding

Summary

• Model the read/write patterns and storage

• Choose an appropriate shard key

• DevOps influenced the application– write recent summary data to separate database– replica set tags for summary database– avoid synchronous sensor checkins– consider changing client polling frequency– consider throttling REST API access to app servers

Page 33: MongoDB for Time Series Data: Sharding

Sign up for our “Path to Proof” Program and get free expert advice

on implementation, architecture, and configuration.

www.mongodb.com/lp/contact/path-proof-program

Page 34: MongoDB for Time Series Data: Sharding
Page 35: MongoDB for Time Series Data: Sharding

Sr. Solutions Architect, MongoDB

Jake Angerman

Thank You