Spark to DocumentDB connector

Spark to DocumentDB Connector

Denny Lee,Principal Program Manager, Azure DocumentDB

Denny Lee• Principal Program Manager for Azure DocumentDB• 20+ years of experience in databases, distributed

systems, data sciences, and software development at Microsoft, Concur, and Databricks

• Noteable Projects:• Project Isotope: Incubation team for HDInsight• Yahoo! 24TB cube: Largest SSAS cube in production

@dennylee

A Brief Overview...

Elastically Scalable Throughput + Storage

Guaranteed low latency

Reads <10ms @ P99Writes <15ms @ P99

Globally Distributed

Speaks your language

DocumentDB

REST over HTTPS/TCP

MongoDB wire protocol

drivers for MongoDB

Java .NET

Java .NETRuby

…

Aggregations

Demo

Running Aggregations from Portal

Supports SUM, COUNT, MIN, MAX, AVG

Working on DISTINCT and GROUP BY

Data Sciences:Apache Spark + DocumentDB

Demo

Notebook View: https://aka.ms/docdb-spark-graphpyView: https://aka.ms/pydocdb-spark-graphCode: https://aka.ms/docdb-spark-graph-code

https://aka.ms/docdb-spark-graph

https://aka.ms/pydocdb-spark-graph

https://aka.ms/docdb-spark-graph-code

AdvantagesData Science Scenarios

• Distributed Aggregations and Analytics

• Blazing Fast IoT Scenarios

• Updateable columns

• Push-down predicate filtering

AdvantagesDistributed Aggregations and Analytics

AdvantagesBlazing Fast IoT Scenarios

Flight information

global safetyalerts

weather

Data Science Scenarios

Device Notifications

Web / REST API

AdvantagesUpdateable Columns

Flight information

Data Science Scenarios

Device Notifications

Web / REST API

{ tripid: “100100”, delay: -5, time: “01:00:01”}

{ tripid: “100100”, delay: -30, time: “01:00:01”}

{delay:-30}

{delay:-30}

{delay:-30}

AdvantagesPushdown Predicate Filtering Data Science Scenarios

{city:SEA}

locations headquarter exports

0 1

country

Germany

city

Seattle

country

France

city

Paris

city

Moscow

city

Athens

Belgium 0 1 {city:SEA, dst: POR, ...},{city:SEA, dst: JFK, ...}, {city:SEA, dst: SFO, ...}, {city:SEA, dst: YVR, ...}, {city:SEA, dst: YUL, ...}, ...

gateway node data

nodes

master node

worker nodes

pyDocumentDB

1

2

3

pyDocumentDB1.Connection is between

Spark master node and DocumentDB gateway node.

2.Query is submitted from DocumentDB gateway node to data nodes. Results are sent back to the gateway node and then transmitted back to the Spark master node.

3.Spark master node converts the dictionary to a DataFrame and distributed out to the worker nodes.

gateway node data

nodes

master node

worker nodes

Spark-DocumentDBConnector (Java)

1

3

2

4

Spark to DocumentDB Connector

1.Connection is between Spark master node and

2.map data is transmitted back to DocumentDB gateway node

3.Query is submitted from Spark worker nodes to

4.DocumentDB data nodes and the data is transmitted back to Spark worker nodes for further processing

Query Test Results

Query pyDocumentDB Azure-DocumentDB-Spark

LIMIT 100 0:00:00.774820 00:00:01.286

All Seattle flights (23K rows)

0:00:05.146107 00:00:01.582

All flights (~1.39M rows) 0:02:36.335267 00:00:08.899

More info at: https://github.com/Azure/azure-documentdb-spark/wiki/Query-Test-Runs

https://github.com/Azure/azure-documentdb-spark/wiki/Query-Test-Runs

Query Test Results

Issue # Issue Description7 Improve push down predicates (e.g. take advantage of TOP/LIMIT,

aggregations, etc.)6 Schema-less query bug5 Optimize computation push to partitions3 Add Python wrapper / examples2 Add Azure-DocumentDB-Spark connector as Spark package

More info at: https://github.com/Azure/azure-documentdb-spark/issues

https://github.com/Azure/azure-documentdb-spark/issues

AsksGo to https://github.com/Azure/azure-documentdb-spark/ and try it out!

References:• Real-time machine learning on globally-distributed data with Apa

che Spark and DocumentDB• Accelerate real-time big-data analytics with the Spark to Docum

entDB connector

Any questions?• We’re on StackOverflow #azure-documentdb• Email askdocdb@ or denny.lee@

https://github.com/Azure/azure-documentdb-spark/

https://azure.microsoft.com/en-us/blog/real-time-machine-learning-on-globally-distributed-data-with-apache-spark-and-documentdb/

https://azure.microsoft.com/en-us/blog/real-time-machine-learning-on-globally-distributed-data-with-apache-spark-and-documentdb/

https://aka.ms/spark-documentdb

https://aka.ms/spark-documentdb

Data Sciences:Apache Spark + DocumentDB

Example: Graph Structures

Example: Graph Structures

Graph Calculations: Degrees, PageRank

What is the most important airport (most flights in / out)

tripGraph.inDegrees\

.sort(desc("inDegree"))\

.limit(10))

Classic Graph Scenario: Flights

vertex = airports

edges = flights

Engineering

Spark to DocumentDB connector