Webinar: MongoDB Schema Design and Performance Implications

MongoDB Schema Design PatternsJumpstart Session

@SigNarvaez

Sigfrido ”Sig” NarváezSr. Solutions Architect, [email protected]@SigNarvaez

mailto:[email protected]

Agenda

Medical Record Example01 Modeling

Relationships03Schema Design: MongoDB vs. Relational

02

Performance04 SummaryQ&A06What’s new

with 3.205

Medical Record Example

Medical Records• Collects all patient information in a central repository• Provide central point of access for

• Patients• Care providers: physicians, nurses, etc.• Billing• Insurance reconciliation

• Hospitals, physicians, patients, procedures, records

PatientRecords

Medications

Lab Results

Procedures

Hospital Records

Physicians

Patients

Nurses

Billing

Medical Record Data• Hospitals

• have physicians

• Physicians• Have patients• Perform procedures• Belong to hospitals

• Patients• Have physicians• Are the subject of procedures

• Procedures• Associated with a patient• Associated with a physician• Have a record• Variable meta data

• Records• Associated with a procedure• Binary data• Variable fields

Lot of Variability

Schema Design: MongoDB vs. Relational

MongoDB Relational

Collections Tables

Documents Rows

Data Use Data Storage

What questions do I have? What answers do I have?

MongoDB vs. Relational

Attribute MongoDB Relational

Storage N-dimensional Two-dimensional

Field Values 0, 1, many, or embed Single value

Query Any field, at any level Any field

Schema Flexible Very structured

MongoDB vs. Relational

Complex Normalized Schemas

Complex Normalized Schemas

Documents are Rich Data Structures{ first_name: ‘Paul’, last_name: ‘Miller’, cell: 1234567890, city: ‘London’, location: [45.123,47.232], professions: [‘banking’, ‘finance’, ‘trader’], physicians: [ { name: ‘Canelo Álvarez, M.D.’, last_visit: ‘Del Carmen Hospital’, last_visit_dt: ‘20160501’, … }, { name: ‘Érik Morales, M.D.’, last_visit: ‘Del Prado Hospital’, last_visit_dt: ‘20160302’, … } ]}

Fields can contain an array of sub-documents

Fields

Strongly Typed field values

Fields can contain arrays

String

Number

Geo-Coordinates

Fields can be indexed and queried at any level

ORM Layer removed – Data is already an object!

Modeling Relationships

1-1

Referencing & Embedding

https://docs.mongodb.com/manual/core/data-modeling-introduction/




Procedure• patient• date• type• physician• type

Results• dataType• size• content:

{…}

Use two collections with a

reference field – “relational”

Procedure• patient• date• type• results

• equipmentId• data1• data2

• physician

• Results• type• size• content:

{…}

Embedding

Document Schema

Referencing

ReferencingProcedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result_id" : 134}

Results{ “_id” : 134 "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }

Embedding Procedure{ "_id" : 333, "date" : "2003-02-09T05:00:00"), "hospital" : “County Hills”, "patient" : “John Doe”, "physician" : “Stephen Smith”, "type" : ”Chest X-ray", ”result" : { "type" : "txt", "size" : NumberInt(12), "content" : { value1: 343, value2: “abc”, … } }}

Embedding

• Advantages• Retrieve all relevant information in a single query/document• Avoid implementing joins in application code• Update related information as a single atomic operation

• MongoDB doesn’t offer multi-document transactions

• Limitations• Large documents mean more overhead if most fields are not

relevant• 16 MB document size limit

Atomicity

• Document operations are atomicdb.patients.update({_id: 12345}, { $inc : { numProcedures : 1 }, $push : { procedures : “proc123” }, $set : { addr.state : “TX” }})

• No multi-document transactions

db.beginTransaction();

db.patients.update({_id: 12345}, …);db.procedure.insert({_id: “proc123”, …});db.records.insert({_id: “rec123”, …});

db.endTransaction();

Embedding

• Advantages• Retrieve all relevant information in a single query/document• Avoid implementing joins in application code• Update related information as a single atomic operation

• MongoDB doesn’t offer multi-document transactions

• Limitations• Large documents mean more overhead if most fields are not

relevant• 16 MB document size limit

Referencing

• Advantages• Smaller documents• Less likely to reach 16 MB document limit• Infrequently accessed information not accessed on every query• No duplication of data

• Limitations• Two queries required to retrieve information• Cannot update related information atomically

1-1: General Recommendations• Embed

• No additional data duplication• Can query or index on

embedded field• e.g., “result.type”

• Exceptional cases…• Embedding results in large

documents• Set of infrequently access

fields

{"_id": 333,"date": "2003-02-09T05:00:00","hospital": "County Hills","patient": "John Doe","physician": "Stephen Smith","type": "Chest X - ray","result": {

"type": "txt","size": 12,"content": {

"value1": 343,"value2": "abc"

}}

}

1-M

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,

…}, { id: 12346, date: 2015-02-15, type: “blood test”,

…}]}

Pat

ient

s

Embed

1-MModeled in 2 possible ways

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [12345, 12346]}

{ _id: 12345, date: 2015-02-15, type: “Cat scan”, …} { _id: 12346, date: 2015-02-15, type: “blood test”, …}

Pat

ient

s

Reference

Pro

cedu

res

1-M : General Recommendations

• Embed, when possible• Many are weak entities• Access all information in a single query• Take advantage of update atomicity• No additional data duplication• Can query or index on any field

• e.g., { “phones.type”: “mobile” }

• Exceptional cases:• 16 MB document size• Large number of infrequently accessed fields

{ _id: 2, first: “Joe”, last: “Patient”, addr: { …}, procedures: [ { id: 12345, date: 2015-02-15, type: “Cat scan”,

…}, { id: 12346, date: 2015-02-15, type: “blood test”,

…}]}

M-M

M-M Traditional Relational Association

Join table Physiciansnamespecialtyphone

Hospitalsname

HosPhysicanRelhospitalIdphysicianIdXUse arrays instead

{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [ { id: 12345, name: “Joe Doctor”, address: {…},

…}, { id: 12346, name: “Mary Well”, address: {…},

…}]}

M-MEmbedding Physicians in Hospitals collection

{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [ { id: 63633, name: “Harold Green”, address: {…},

…}, { id: 12345, name: “Joe Doctor”, address: {…},

…}]}

Data Duplication…

is ok!

{ _id: 1, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}

M-MReferencing

{ id: 63633, name: “Harold Green”, hospitals: [1,2], …}

Hospitals

{ _id: 2, name: “Plainmont Hospital”, city: “Omaha”, beds: 85, physicians: [63633, 12345]}

Physicians

{ id: 12345, name: “Joe Doctor”, hospitals: [1], …}

{ id: 12346, name: “Mary Well”, hospitals: [1,2], …}

M-M : General Recommendation• Use case determines whether to reference or embed:

1. Data Duplication• Embedding may result in data

duplication• Duplication may be okay if reads

dominate updates• Of the two, which one changes the

least?2. Referencing may be required if many

related items3. Hybrid approach

• Potentially do both .. It’s ok!

{ _id: 2, name: “Oak Valley Hospital”, city: “New York”, beds: 131, physicians: [12345, 12346]}

{ _id: 12345, name: “Joe Doctor”, address: {…}, …} { _id: 12346, name: “Mary Well”, address: {…}, …}

Hos

pita

ls

Reference

Phy

sici

ans

Performance

Example 1: Hybrid ApproachEmbed and Reference

Healthcare Example

patients

procedures

Tailor Schema to Queries

{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : ["551ac”, “343fs”]}

{ "_id" : "551ac”, "date" :"2000-04-26”, "hospital" : 161, "patient" : 593340651, "physician" : 10387, "type" : "Chest X-ray", "records" : [ “67bc6”]}

Patient Procedure

Find all patients from NH that have had chest x-rays

Tailor Schema to Queries (cont.){ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : [ {id : "551ac”, type : “Chest X-ray”}, {id : “343fs”, type : “Blood Test”}]}


Patient Procedure


3.2’s $lookup!!(left-outer

join)

Example 2: Time Series DataMedical Devices

Vital Sign Monitoring DeviceVital Signs Measured:• Blood Pressure• Pulse• Blood Oxygen Levels

Produces data at regular intervals• Once per minute • Many Devices, Many Hospitals

Data From Vital Signs Monitoring Device

{ deviceId: 123456, ts: ISODate("2013-10-16T22:07:00.000-0500"), spO2: 88, pulse: 74, bp: [128, 80]}

• One document x minute x device• Relational approach

Document Per Hour (By minute)

{ deviceId: 123456, ts: ISODate("2013-10-16T22:00:00.000-0500"), spO2: { 0: 88, 1: 90, …, 59: 92}, pulse: { 0: 74, 1: 76, …, 59: 72}, bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}} • 1 document x device x hour

• Store per-minute data at the hourly level

• Update-driven workload

Characterizing Write Differences

• Example: data generated every minute• Recording the data for 1 patient for 1 hour:

Document Per Event60 inserts

Document Per Hour1 insert, 59 updates

Characterizing Read Differences

• Want to graph 24 hour of vital signs for a patient:

• Read performance is greatly improved

Document Per Event 1440 reads

Document Per Hour24 reads

Characterizing Memory and Storage Differences

Document Per Minute Document Per HourNumber Documents 52.6 Billion 876 Million

Total Index Size 6,364 GB 106 GB_id index 1,468 GB 24.5 GB{ts: 1, deviceId: 1} 4,895 GB 81.6 GB

Document Size 92 Bytes 758 BytesDatabase Size 4,503 GB 618 GB

• 100K Devices • 1 years worth of data, at second resolution (365 x 24 x 60)

MongoDB 3.2

MongoDB 3.2 – a GIANT Release

Hash-Based ShardingRolesKerberosOn-Prem Monitoring

2.2 2.4 2.6 3.0 3.2

Agg. FrameworkLocation-Aware Sharding

$outIndex IntersectionText SearchField-Level RedactionLDAP & x509Auditing

Document ValidationFast FailoverSimpler ScalabilityAggregation ++Encryption At RestIn-Memory Storage EngineBI Connector$lookupMongoDB CompassAPM IntegrationProfiler VisualizationAuto Index BuildsBackups to File System

Doc-Level ConcurrencyCompressionStorage Engine API≤50 replicasAuditing ++Ops Manager

Tools• mgenerate

• Part of mtools: https://github.com/rueckstiess/mtools/wiki/mgenerate

• Model schema using json definition

• Generate Millions of documents with random data

• How well does the schema work?• Queries, Indexes, Data Size, Index Size, Replication

• Demo

https://github.com/rueckstiess/mtools/wiki/mgenerate

Documents are Rich Data Structures{ first_name: ‘Paul’, last_name: ‘Miller’, cell: 1234567890, city: ‘London’, location: [45.123,47.232], professions: [‘banking’, ‘finance’, ‘trader’], physicians: [ { name: ‘Canelo Álvarez, M.D.’, last_visit: ‘Mission Hospital’, last_visit_dt: ‘20160501’, … }, { name: ‘Érik Morales, M.D.’, last_visit: ‘Del Prado Hospital’, last_visit_dt: ‘20160302’, … } ]}

Fields can contain an array of sub-documents

Fields

Typed field values

Fields can contain arrays

String

Number

Geo-Coordinates

Fields can be indexed and queried at any level

ORM Layer removed – Data is already an object!

Schema using mgenerate{ "first_name" : { "$string" : { "length" : 30 }}, "last_name" : { "$string" : { "length" : 30 }}, "cell" : "$number", "city" : { "$string" : { "length" : 30 }}, "location" : [ "$number", "$number"], "professions" : { "$array" : [ {

"$choose" : [ "banking", "finance", "trader" ] }, { "$number": [1, 3] }

] }, "physicians" : { "$array" : [ { "name" : { "$string" : { "length" : 30 }}, "last_visit" : { "$string" : { "length" : 30 }}, "last_visit_dt" : "$datetime" }, { "$number" : [1, 5]} ] }}

> mgenerate --host localhost --port 27017 -d webinar -c patients --drop -n 100 patients.json

Use Compass to visualize & query data!

Visual Query ProfilerIdentify your slow-running queries with the click of a button

Index SuggestionsIndex recommendations to improve your deployment

&

MongoDB 3.2 $lookup{ "_id" : 593340651, "first" : "Gregorio", "last" : "Lang", "addr" : { "street" : "623 Flowers Rd", "city" : "Groton", "state" : "NH", "zip" : 3266 }, "physicians" : [10387 33456], "procedures” : [ {id : "551ac”, type : “Chest X-ray”}, {id : “343fs”, type : “Blood Test”}]}


Patient Procedure


3.2’s $lookup!!(left-outer

join)

MongoDB 3.2 $lookup

{ "_id": 593340651,"first": "Gregorio","last": "Lang","addr": {

"street": "623 Flowers Rd","city": "Groton",

"state": "NH","zip": 3266 },

"physicians": [10387, 33456],"procedures": ["551ac", "343fs"]}


Patient Procedure

Obtain Patient view with Procedure details, but

without Physicians

MongoDB 3.2 $lookupdb.PatientsColl.aggregate([ { "$match" : { "_id": 593340651 }}, { "$unwind" : "$procedures"}, { "$lookup" : { "from" : "ProceduresColl", "localField" : "procedures", "foreignField": "_id", "as" : "procs" }}, { "$unwind" : "$procs" }, { "$group" : { "_id" : { "_id" : "$_id", "first" : "$first", "last" : "$last", "addr" : "$addr" }, "procedures" : { "$push" : "$procs"} } }, { "$project" : { "_id" : "$_id._id", "first" : "$_id.first", "last" : "$_id.last", "addr" : "$_id.addr", "procedures._id" : 1, "procedures.type" : 1, "procedures.date" : 1 }}]);https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

{"_id": 593340651,"first": "Gregorio","last": "Lang","addr": {

"street": "623 Flowers Rd",

"city": "Groton","state": "NH","zip": 3266

},"procedures": [{

"_id": "551ac",

"date": "2000-04-26",

"type": "Chest X-ray"

}, {"_id":

"343fs","date":

"2000-04-26","type":

"Blood Test"}]

}

Obtain Patient view with Procedure details, but

without Physicians

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/



MongoDB 3.2 Document Validation

db.runCommand( {collMod: "Patients", validator: { $and: [

{ "first_name": { "$type": "string" }},

{ "last_name": { "$type": "string"}}, { "physicians": { "$type": "array"}}

] }, validationLevel: "strict"

});

https://docs.mongodb.com/manual/core/document-validation/

All Patient records must have alphanumeric data for the first and last name, and a list of Physicians




Summary

Embedding and Referencing01

Context of Application Data and Query Workload

Decisions031-1 : Embed1-M : Embed when possible

M-M : Hybrid

02

Different schemas may result in dramatically different query performance, data/index size and hardware requirements!

Iterate04$lookupDocument Validation

3.206Measure data/index size, query performance- mgenerate/mtools- Compass- Cloud Manager / Ops Manager

Tools!05

Q&ASigfrido Narváez

Sr. Solutions Architect, MongoDB

Technology

Webinar: MongoDB Schema Design and Performance Implications