[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features

Azure DocumentDB:Deep Dive into

Advanced FeaturesAravind RamachandranProgram ManagerAzure DocumentDB@arkramac

Andrew LiuProgram ManagerAzure DocumentDB@aliuy8

A Quick Recap…

3 V’s of data : Endless possibilities

Learning

Gaming

Retail

Telematics

Mobile Apps

IoT

Velocity :High

Throughputwith Low Latency

Volume :Massive

Amounts of Data

Variety : Schema-freedom

The 2x2s of database tradeoffs

Latency

Dur

abili

tyLow

High

HighLow

Schema/Index Management

Que

ry

Poor

Rich

Agnostic

Required

Availability

Prog

ram

mab

ility

Low

High

HighLow

Scale

Dis

trib

uti

onSingle DC

World

Elastic

Static

Scale

Txn

Scop

e

Single item

Multiple items

HighLow

Performance Isolation

TCO

Low

High

AirtightNoisy Neighbor

DocumentDB: Capabilities

Guaranteed low latency

• <10ms reads/<15ms writes @ P99. • Requests are served from local

region• Write optimized, latch-free database

engine designed for SSDs and low latency access.

• Synchronous and automatic document indexing at sustained ingestion rates

Elastic and limitless

global scale• Independently scale throughput and storage - locally and globally

• Transparent partition management and routing

Multiple consistency levels

• Multiple well defined consistency levels• Intuitive programming model for relaxed consistency

models • Clear PACELC tradeoffs and 99.99% availability SLAs

SQL and JavaScript –

schema free• Automatic tree path based indexing • No schemas or secondary indices

required upfront• SQL and JavaScript language

integrated queries• Hash, range, and spatial• Multi-document, JavaScript language

integrated transactions

DocumentDB resource model Resources• identified by their logical and stable URI • Represented as JSON documents• Partitioned and across span machines, clusters and

regions

1

Resource model• Stateless interaction (HTTP and TCP)• Hierarchical overlay atop partitioning

model

2

Partitioning Model• Grid Partitioning – horizontal based on

hash/range and vertical across regions• Each partition made highly available via a

replica set

3

…

Replica-set

…

…

US-East

US-West

N Europe

Partitions

Partition set

Local distribution

Glo

bal d

istr

ibut

ion

Accessing DocumentDB

Java .NET

TCP/SSL HTTPS

DocumentDB Service

DocumentDB client SDKs and tools DocumentDB

Hadoop and Spark connectorsJSON, SQL,

JavaScript

MongoDB wire protocol

drivers for MongoDB

Java .NETRuby…

MongoDB toolchain, MongoDB client drivers, Parse SDK

Clients

BSON

Let’s talk about…• Modeling JSON Documents

• Collections and Scaling

• Query and Indexing

• Global Distribution

• Tips and Best Practices

Everything you need to know to build

Blazing fast, planet-scale applications!

Let’s talk about JSON documents

"With great power comes great responsibility“

- Uncle BenDocumentDB gives you the power of true schema-freedom.Generally de-normalize… but don't just do it blindy.

How do approaches differ?

Data normalizationORM


Come as you are

Data normalizationORM


Person

Address

ContactDetail

ContactDetailType

PersonContactDetailLnk

PersonIdContactDetailId

Id Id

Id Id

Modeling Data: The Relational Way

Person Id

Addresses

{ "id": "0ec1ab0c-de08-4e42-a429-...", "addresses": [ { "street": "1 Redmond Way", "city": "Redmond", "state": "WA", "zip": 98052} ], "contactDetails": [ {"type": "home", "detail": “555-1212"}, {"type": "email", "detail": “[email protected]"} ], ...}

Address…

Address…

ContactDetails

ContactDetail…

Modeling Data: The Document Way

To embed, or to reference, that is the question

{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "addresses": [ { "line1": "100 Some Street", "line2": "Unit 1", "city": "Seattle", "state": "WA", "zip": 98012 } ], "contactDetails": [ {"email: "[email protected]"}, {"phone": "+1 555 555-5555", "extension": 5555} ] }

Try model your entity as a self-contained documentGenerally, use embedded data models when:

There are "contains" relationships between entitiesThere are one-to-few relationships between entities Embedded data changes infrequentlyEmbedded data won’t grow without boundsEmbedded data is integral to data in a document

Data modeling with denormalization

Denormalizing typically provides for better read performance

In general, use normalized data models when:

Write performance is more important than read performanceRepresenting one-to-many relationshipsCan representing many-to-many relationshipsRelated data changes frequently

Provides more flexibility than embeddingMore round trips to read data

Data modeling with referencing

{"id": "xyz","username:

"user xyz"}

{"id":

"address_xyz","userid": "xyz",

"address" : {…

}}

{"id:

"contact_xyz","userid": "xyz","email" :

"[email protected]" "phone" : "555 5555"}

User document

Address document

Contact details document

Normalizing typically provides better write performance

No magic bullet

Hybrid Approach:Model on a property-level(as opposed to record-level)

Optimize your data model for your workload…(as opposed to blindly following types)

Modeling impacts RU due to document size

Hybrid models

{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "countOfBooks": 3, "books": [1, 2, 3], "images": [

{"thumbnail": "http://....png"} {"profile": "http://....png"}

] }

{ "id": 1, "name": "DocumentDB 101", "authors": [

{"id": 1, "name": "Thomas Andersen", "thumbnail": "http://....png"},

{"id": 2, "name": "William Wakefield", "thumbnail": "http://....png"}

] }

Author document

Book document

Collections + Elastic Scale

Elastic scale

Measuring Throughput (Request Units)

Replica gets a fixed budget of request units

Request Unit/sec (RU) is the normalized currency

% IOPS

% CPU

% Memory

READGET Documen

t

Documents

INSERT

POST

REPLACE

PUT Document

Operations consume request units (RUs)

QueryPOST Documen

ts

…

Min RU/sec

Max RU/sec

Inco

min

g Re

ques

ts

Replica Quiescent

Ratelimit

Nothrottling

Requests get rate limited if they exceed the SLA Customers pay for

reserved request units by the hour

What are partitions?

…. ….

Partition 1

Partition 2

Partition i Partition n

…

Collection

What are partitions?

…. ….

London

Paris

…

Partition 1

Partition 2


New York …

Houston

Chicago

New Delhi

Mumbai

Boston

Berlin

…

Partition Key = city

Partitioning patterns Writes should scale across Partition Keys

…. ….

…

Partition 1

Partition 2


…

……

Partitioning patterns Writes should scale across Partition Keys

…. ….

…

Partition 1

Partition 2


…

……

Partitioning patterns Reads should minimize cross-partition lookups

…. ….

…

Partition 1

Partition 2


…

……

Recipe for Choosing Partition Key• Start with the Workload – Is it Read vs Write heavy?

• Top Queries – Look for commonly filtered properties

• Transaction Boundary

• Avoid Storage + Performance Bottlenecks

• Aim for high cardinality… More partition key values = happiness

• Examples: Partition by TenantId or DeviceId… composite w/ Timestamp

Let's talk about Query and Indexing

Query and IndexingDemo

DocumentDB: SQL and JavaScript queries

{ "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports": [{ "city": "Moscow" }, { "city": "Athens" }]};

locations headquarter exports

0 1

country

Germany

city

Berlin

country

France

city

Paris

city

Moscow

city

Athens

Belgium 0 1

{ "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ], "headquarter": "Italy", "exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" } ]}; locations headquarter

0

country

Germany

city

Bonn

revenue

200

Italy

exports

city

Berlin

city

Athens

0 1

dealers

0

Hans

name

{ "results": [ { "locations": [ {"country":"Germany","city":"Berlin"}, {"country":"France","city":"Paris"} ] } ]}

0

locations

0 1

country

Germany

city

Berlin

country

France

city

Paris

results

SELECT C.locations FROM company C WHERE C.headquarter = "Belgium"

SQL

function businessLogic() { var country = "Belgium"; __.filter(function(x){return x.headquarter===country;});}

JavaScript

Indexing under the hood• Logically the index is a union of all the document trees• Structure contributed by the interior nodes, instance values are

the leavesCommonstructure

• Structural information and instance values are normalized into a unifying concept of JSON-Path

Terms Postings List

$/location/0/ 1, 2location/0/country/ 1, 2location/0/city/ 1, 20/country/Germany

1, 2

1/country/France 2 … …0/city/Moscow 20/dealers/0 2

0

Germany

location

0

location

country

0

country

Range & ORDERBY queries

0

Germany

location

0

location

country

0

country

Wildcard queries Spatial queries

0

coordinates

1 2

Dynamic Encoding of Postings List(E-WAH/differential)

Check out our

VLDB paper, her

e!

http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf

http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf

Queries that use the index

• Equality: =• Range: <, >, <=, >=• ORDER BY• String operators: STARTSWITH• Spatial operators: ST_WITHIN and ST_DISTANCE• Array operators: ARRAY_CONTAINS• Schema operators: IS_DEFINED, IS_NUMBER, IS_STRING, …

Indexing PoliciesConfiguration Level Options

Automatic Per collection True (default) or False Override with each document write

Indexing Mode Per collection Consistent, Lazy, and NoneNone for KV workloads

Included and excluded paths

Per path Individual path or recursive includes (? And *)

Indexing Type Per path Support Hash, Range, and Spatial

Indexing Precision Per path Supports 1 – 100 per path (and max)Tradeoff storage, query RUs and write Rus

Let’s talk about Planet-Scale

Guaranteed low latency

“I want my data wherever my users are.”

Guaranteed high availability

Globally. With policy based failover.

99.99%

Multi-region DocumentDB databases

=DocumentDB Collection

…

Replica-set

…

…

US-East

US-West

India

Partitions

Partition set

Glo

bal d

istr

ibut

ion

Local distribution

Primary Replica-sets

…

2M RUs

…Secondary Replica-sets 2M

RUs …

2M RUs

Secondary Replica-sets

…A DocumentDB collection

2M RUs

Total RUs = Provisioned RUs x Number of regions

In this example: 2M RUs x 3 regions = 6M RUs

Programmable data consistency

“Its hard to write distributed apps.”

Strong consistency, High latency

Eventual consistency, Low latency

Consistency Levels• PACELC Theorem and the associated tradeoffs

Consistency Levels• Strong, Eventual, Bounded Staleness, and

Session

Strong

Bounded Staleness

Session

Eventual

LEFT TO RIGHT Weaker Consistency, Better Read scalability, Lower write latency

Client

P SS

Client

P SS

Client

P SS

Client

P SS

Client

• Consistent Prefix reads. • Reads lag behind writes by

K prefixes or T interval

• Monotonic reads, writes and Read your writes guarantee

Global DistributionDemo

DocumentDB Recent Updates

• Automatic Expiration via Time-To-Live (TTL)

• Expanded Geo-Spatial support for Polygons and Lines

• Preview Support for• Local Emulator• IP Filtering• Self-Service Backup + Restore• Protocol Support for MongoDB

Q&A and more resources…

AskDocDB@microsoft

Follow @DocumentDBUse #DocumentDB

documentdb.com

#azure-documentDB

Session Evaluations

ways to access

Go to passSummit.com

Download the GuideBook App and search: PASS Summit 2016

Follow the QR code link displayed on session signage throughout the conference venue and in the program guide

Submit by 5pmFriday November 6th toWIN prizes

Your feedback is important and valuable. 3

Thank You Learn more from

Azure [email protected] or follow @DocumentDB

Data & Analytics

[PASS Summit 2016] Azure DocumentDB: A Deep Dive into Advanced Features