73

Webinar: Best Practices for Getting Started with MongoDB

  • Upload
    mongodb

  • View
    4.347

  • Download
    5

Embed Size (px)

Citation preview

Page 1: Webinar: Best Practices for Getting Started with MongoDB
Page 2: Webinar: Best Practices for Getting Started with MongoDB

MongoDB Best Practices

Jay RunkelPrincipal Solutions [email protected]

@jayrunkel

Page 3: Webinar: Best Practices for Getting Started with MongoDB

About Me• Solution Architect

• Part of Sales Organization

• Work with many organizations new to MongoDB

Page 4: Webinar: Best Practices for Getting Started with MongoDB

Everyone Loves MongoDB’s Flexibility• Document Model

• Dynamic Schema

• Powerful Query Language

• Secondary Indexes

Page 5: Webinar: Best Practices for Getting Started with MongoDB

Everyone Loves MongoDB’s Flexibility• Document Model

• Dynamic Schema

• Powerful Query Language

• Secondary Indexes

Page 6: Webinar: Best Practices for Getting Started with MongoDB

Sometimes Organizations Struggle with Performance

Page 7: Webinar: Best Practices for Getting Started with MongoDB

Good News!• Poor Performance Usually Due to Common (and often simple) mistakes

Page 8: Webinar: Best Practices for Getting Started with MongoDB

Agenda• Quick MongoDB Introduction

• Best Practices

1. Hardware/OS

2. Schema/Queries

3. Loading Data

Page 9: Webinar: Best Practices for Getting Started with MongoDB

MongoDB Introduction

Page 10: Webinar: Best Practices for Getting Started with MongoDB

Document Data ModelRelational MongoDB

{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

Page 11: Webinar: Best Practices for Getting Started with MongoDB

Documents are Rich Data Structures{ first_name: ‘Paul’, surname: ‘Miller’, cell: 447557505611, city: ‘London’, location: [45.123,47.232], Profession: [‘banking’, ‘finance’, ‘trader’], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } ]}

Fields can contain an array of sub-documents

Fields

Typed fields

Fields can contain arrays

String

Number

Geo-Coordinates

Page 12: Webinar: Best Practices for Getting Started with MongoDB

Do More With Your Data

{ first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } }}

Rich QueriesFind everybody in London with a car built between 1970 and 1980

Geospatial Find all of the car owners within 5km of Trafalgar Sq.

Text Search Find all the cars described as having leather seats

Aggregation Calculate the average value of Paul’s car collection

Map Reduce

What is the ownership pattern of colors by geography over time?(is purple trending up in China?)

Page 13: Webinar: Best Practices for Getting Started with MongoDB

Automatic Sharding

Three types: hash-based, range-based, location-aware

Increase or decrease capacity as you go

Automatic balancing

Page 14: Webinar: Best Practices for Getting Started with MongoDB

Query Routing

Multiple query optimization models

Each sharding option appropriate for different apps

mongos

Page 15: Webinar: Best Practices for Getting Started with MongoDB

Replica SetsReplica Set – 2 to 50 copies

Self-healing shard

Data Center Aware

Addresses availability considerations:

High Availability

Disaster Recovery

Maintenance

Workload Isolation: operational & analytics

Page 16: Webinar: Best Practices for Getting Started with MongoDB

Assumptions

Page 17: Webinar: Best Practices for Getting Started with MongoDB

Assumptions

MongoDB 3.0 or 3.2

Page 18: Webinar: Best Practices for Getting Started with MongoDB

Storage Engine Architecture in 3.2

Content Repo

IoT Sensor Backend Ad Service Customer

Analytics Archive

MongoDB Query Language (MQL) + Native Drivers

MongoDB Document Data Model

WT MMAP

Supported in MongoDB 3.2

Man

agem

ent

Sec

urity

In-memory (beta) Encrypted 3rd party

Page 19: Webinar: Best Practices for Getting Started with MongoDB

Best Practices

Hardware/Operating System

Page 20: Webinar: Best Practices for Getting Started with MongoDB

Servers• Specifications Good Fit For MongoDB?

• Correct Number of Servers?

• Properly Configured?

Page 21: Webinar: Best Practices for Getting Started with MongoDB

What Type of Servers• RAM

– 64 256 GB+

• Fast IO Systems– RAID-10/SSDs

• Many cores – Compress/Uncompress– Encrypt/Decrypt– Aggregation queries

Page 22: Webinar: Best Practices for Getting Started with MongoDB

What about a SAN?• Mostly Random Disk Access

• IOPS

• Need dedicated IOPS or performance will vary

• Configure your SAN properly

• Suitability of any IO system will depend upon IOPS

Page 23: Webinar: Best Practices for Getting Started with MongoDB

How Many Servers Do I Need?• How Many Shards Do I Need?

Page 24: Webinar: Best Practices for Getting Started with MongoDB

MongoDB cluster sizing at 30,000 ft• Disk Space

• RAM

• Query Throughput

Page 25: Webinar: Best Practices for Getting Started with MongoDB

• Sum of disk space across shards > greater than required storage size

Disk Space: How Many Shards Do I Need?

Page 26: Webinar: Best Practices for Getting Started with MongoDB

• Sum of disk space across shards > greater than required storage size

Disk Space: How Many Shards Do I Need?

Example

Data Size = 9 TBWiredTiger Compression Ratio: .33Storage size = 3 TBServer disk capacity = 2 TB

2 Shards Required

Page 27: Webinar: Best Practices for Getting Started with MongoDB

• Working set should fit in RAM– Sum of RAM across shards > Working Set

• WorkSet = Indexes plus the set of documents accessed frequently

• WorkSet in RAM – Shorter latency– Higher Throughput

RAM: How Many Shards Do I Need?

Page 28: Webinar: Best Practices for Getting Started with MongoDB

• Measuring Index Size – db.coll.stats() – index size of collection

• Estimate frequently accessed documents– Ex: total size of documents accessed

per day

RAM: How Many Shards Do I Need?

Page 29: Webinar: Best Practices for Getting Started with MongoDB

• Measuring Index Size – db.coll.stats() – index size of collection

• Estimate frequently accessed documents– Ex: total size of documents accessed

per day

RAM: How Many Shards Do I Need?

Example

Working Set = 428 GBServer RAM = 128 GB

428/128 = 3.34

4 Shards Required

Page 30: Webinar: Best Practices for Getting Started with MongoDB

• Measure max sustained query rate of a single server (with replication)– build a prototype and measure

• Assume sharding overhead of 20-30%

Query Rate: How Many Shards Do I Need?

Page 31: Webinar: Best Practices for Getting Started with MongoDB

• Measure max sustained query rate of a single server (with replication)– build a prototype and measure

• Assume sharding overhead of 20-30%

Query Rate: How Many Shards Do I Need?

Example

Require: 50K ops/secPrototype performance: 20 ops/sec (1 replica set)

4 Shards Required: 80 ops/sec * .7 = 56K ops/sec

Page 32: Webinar: Best Practices for Getting Started with MongoDB
Page 33: Webinar: Best Practices for Getting Started with MongoDB

Configure Them Properly• Default OS Settings Often Don’t Provide Optimal Performance

• See MongoDB Production Notes– https://docs.mongodb.org/manual/administration/production-notes

• Also Review:– Amazon EC2: https://docs.mongodb.org/ecosystem/platforms/amazon-ec2/– Azure: https://docs.mongodb.org/ecosystem/platforms/windows-azure/

Page 34: Webinar: Best Practices for Getting Started with MongoDB

Server/OS Configuration• Server configuration recommendations

– XFS– Turn off atime and diratime – NOOP scheduler– File descriptor limits– Disable transparent huge pages and NUMA– Read ahead of 32– Separate data volumes for data files, the journal, and the log.– Change the default TCP keepalive time to 300 seconds.

Page 35: Webinar: Best Practices for Getting Started with MongoDB

These are important• Ignore them and your performance may suffer

• The first 100 lines of the MongoDB logs identifies suboptimal OS settings

Page 36: Webinar: Best Practices for Getting Started with MongoDB

Best Practices

Schema Design

Page 37: Webinar: Best Practices for Getting Started with MongoDB

Don’t Use a Relational Schema

Page 38: Webinar: Best Practices for Getting Started with MongoDB

Taylor MongoDB Schema to Application Workload

• Design schema to provide good query performance

• Schema design will impact required number of shards!

Application Query Workload

{ Name: “john” Height: 12 Address: {…}}

db.cust.find({…})

db.cust.aggregate({…})

Page 39: Webinar: Best Practices for Getting Started with MongoDB

Compare Alternative Schemas• Build a spreadsheet

• Calculate # of shards for each schema

• Estimate query performance– # of documents– # of inserts – # of deletes– Required indexes– Number of documents inspected– Number of documents sent across network

Page 40: Webinar: Best Practices for Getting Started with MongoDB

Modeling Decisions• Referencing vs. Embedding

• Aggregating data by device, customer, product, etc.

Page 41: Webinar: Best Practices for Getting Started with MongoDB

ReferencingProcedure

{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : 134}

Results

{ “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }

Page 42: Webinar: Best Practices for Getting Started with MongoDB

EmbeddingProcedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}

Page 43: Webinar: Best Practices for Getting Started with MongoDB

Embedding• Advantages

– Retrieve all relevant information in a single query/document– Avoid implementing joins in application code– Update related information as a single atomic operation

• MongoDB doesn’t offer multi-document transactions

• Limitations– Large documents mean more overhead if most fields are not relevant– Might mean replicating data– 16 MB document size limit

Page 44: Webinar: Best Practices for Getting Started with MongoDB

Referencing• Advantages

– Smaller documents– Less likely to reach 16 MB document limit– Infrequently accessed information not accessed on every query– No duplication of data

• Limitations– Two queries required to retrieve information– Cannot update related information atomically

Page 45: Webinar: Best Practices for Getting Started with MongoDB

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,

…}, { id: 12346, date: 2015-02-15, type: “blood test”,

…}]}

Pat

ient

s

Embed

One-to-Many & Many-to-Many Relationships

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]}

{ _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …}

Pat

ient

s

Reference

Pro

cedu

res

Page 46: Webinar: Best Practices for Getting Started with MongoDB

Schema Alternatives – Do the math?• How complex queries?

• How much hardware/shards will I need?

Page 47: Webinar: Best Practices for Getting Started with MongoDB

Vital Sign Monitoring DeviceVital Signs Measured:• Blood Pressure• Pulse• Blood Oxygen Levels

Produces data at regular intervals• Once per minute

Page 48: Webinar: Best Practices for Getting Started with MongoDB

We have a hospital(s) of devices

Page 49: Webinar: Best Practices for Getting Started with MongoDB

Data From Vital Signs Monitoring Device{ deviceId: 123456, spO2: 88, pulse: 74, bp: [128, 80], ts: ISODate("2013-10-16T22:07:00.000-0500")}

• One document per minute per device

• Relational approach

Page 50: Webinar: Best Practices for Getting Started with MongoDB

Document Per Hour (By minute){ deviceId: 123456, spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}, ts: ISODate("2013-10-16T22:00:00.000-0500")}

• Store per-minute data at the hourly level

• Update-driven workload

• 1 document per device per hour

Page 51: Webinar: Best Practices for Getting Started with MongoDB

Characterizing Write Differences• Example: data generated every minute• Recording the data for 1 patient for 1 hour:

Document Per Event60 inserts

Document Per Hour1 insert, 59 updates

Page 52: Webinar: Best Practices for Getting Started with MongoDB

Characterizing Read Differences• Want to graph 24 hour of vital signs for a patient:

• Read performance is greatly improved

Document Per Event 1440 reads

Document Per Hour24 reads

Page 53: Webinar: Best Practices for Getting Started with MongoDB

Characterizing Memory and Storage Differences

Document Per Minute Document Per HourNumber Documents 52.6 B 876 M

Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB

Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB

• 100K Devices • 1 years worth of data

Page 54: Webinar: Best Practices for Getting Started with MongoDB

Characterizing Memory and Storage Differences

Document Per Minute Document Per HourNumber Documents 52.6 B 876 M

Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB

Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB

• 100K Devices • 1 years worth of data

100000 * 365 * 24 *

60

100000 * 365 * 24

Page 55: Webinar: Best Practices for Getting Started with MongoDB

Characterizing Memory and Storage Differences

Document Per Minute Document Per HourNumber Documents 52.6 B 876 M

Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB

Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB

• 100K Devices • 1 years worth of data

100000 * 365 * 24 * 60 * 130

100000 * 365 * 24 *

130

Page 56: Webinar: Best Practices for Getting Started with MongoDB

Characterizing Memory and Storage Differences

Document Per Minute Document Per HourNumber Documents 52.6 B 876 M

Total Index Size 6364 GB 106 GB_id index 1468 GB 24.5 GB{ts: 1, deviceId: 1} 4895 GB 81.6 GB

Document Size 92 Bytes 758 BytesDatabase Size 4503 GB 618 GB

• 100K Devices • 1 years worth of data

100000 * 365 * 24 *

60 * 92

100000 * 365 * 24 *

758

Page 57: Webinar: Best Practices for Getting Started with MongoDB

Best Practices

Loading Data

Page 58: Webinar: Best Practices for Getting Started with MongoDB

Rule of Thumb• To saturate a MongoDB cluster

– loader hardware ~= mongodb hardware

• Many threads

• Many mongos

Page 59: Webinar: Best Practices for Getting Started with MongoDB

Loader Architecture

loader

mongos

primary

primary

primary

secondary

secondary

secondary

secondary

secondary

secondary

Page 60: Webinar: Best Practices for Getting Started with MongoDB

Loader Architecture

loader

mongos

primary

primary

primary

secondary

secondary

secondary

secondary

secondary

secondary

Where are the bottlenecks?

Page 61: Webinar: Best Practices for Getting Started with MongoDB

Loader Architecture

loader

mongos

primary

primary

primary

secondary

secondary

secondary

secondary

secondary

secondary

Where are the bottlenecks?

Page 62: Webinar: Best Practices for Getting Started with MongoDB

Loader Architecture

loader (8)

mongos (4)

primary

primary

primary

secondary

secondary

secondary

secondary

secondary

secondaryloader (8)

mongos (4)

loader (8)

mongos (4)Use many threads

Use multiple loader servers

Page 63: Webinar: Best Practices for Getting Started with MongoDB

When Sharding• If you care about initial performance, you must pre-split

• Otherwise, initial performance will be slow

• (hash sharding automatically presplits collection)

Page 64: Webinar: Best Practices for Getting Started with MongoDB

Without presplitting

Shard 1 Shard 2 Shard 3 Shard 4

-∞ … ∞

• sh.shardCollection(“records.patients”, {zipcode : 1})

Page 65: Webinar: Best Practices for Getting Started with MongoDB

Without presplitting

Shard 1 Shard 2 Shard 3 Shard 4

-∞ … 11305

• 64K chunks• Splitting will occur quickly• Balancing occurs much more slowly

• The entire query workload Shard 1

11306 … 4450644507 … ∞

Page 66: Webinar: Best Practices for Getting Started with MongoDB

Without presplitting

Shard 1 Shard 2 Shard 3 Shard 4

-∞ … 1130511306 … 4450644507 … ∞

Loadermongos

Page 67: Webinar: Best Practices for Getting Started with MongoDB

Split collection

Shard 1 Shard 2 Shard 3 Shard 4

• Split and distribute empty chunks before loading any data

• Evenly distribute query load across cluster

-∞ … 0833308334 … 1666716668 … 25000

25001… 3333433335 … 4166841669 … 50000

50001 … 5833458335 … 6666866669 … 75000

75001 … 8333488335 … 9666896669 … 99999

Page 68: Webinar: Best Practices for Getting Started with MongoDB

Split collection

Shard 1 Shard 2 Shard 3 Shard 4

-∞ … 0833308334 … 1666716668 … 25000

25001… 3333433335 … 4166841669 … 50000

50001 … 5833458335 … 6666866669 … 75000

75001 … 8333488335 … 9666896669 … 99999

Loadermongos

Page 69: Webinar: Best Practices for Getting Started with MongoDB

Summary

Page 70: Webinar: Best Practices for Getting Started with MongoDB

Best Practices1. Use servers with specifications that will provide good MongoDB performance

– 64+ GB RAM, many cores, many IOPS (RAID-10/SSDs)

2. Calculate How Many Shards?1. Calculate required RAM and Disk Space2. Build a prototype to determine the ops/sec capacity of a server3. Do the math

3. Configure OS for Optimal MongoDB Performance– See MongoDB Production Notes– Review logs for warnings (Don’t ignore)

Page 71: Webinar: Best Practices for Getting Started with MongoDB

Best Practices (cont.)4. Create a Document Schema

– Denormalized

5. Tailor schema to application workload– Use application queries to guide schema design decisions– Consider alternative schemas– Compare cluster size (# of shards) and performance– Build a spreadsheet

Page 72: Webinar: Best Practices for Getting Started with MongoDB

Best Practices6. Loading Data

– Loader Hardware ~= MongoDB hardware– Many threads– Many mongos

7. Pre-split– Ensure query workload is evenly distributed across the cluster from the start