26
Spark to DocumentDB Connector Denny Lee, Principal Program Manager, Azure DocumentDB

Spark to DocumentDB connector

Embed Size (px)

Citation preview

Page 1: Spark to DocumentDB connector

Spark to DocumentDB Connector

Denny Lee,Principal Program Manager, Azure DocumentDB

Page 2: Spark to DocumentDB connector

Denny Lee• Principal Program Manager for Azure DocumentDB• 20+ years of experience in databases, distributed

systems, data sciences, and software development at Microsoft, Concur, and Databricks

• Noteable Projects:• Project Isotope: Incubation team for HDInsight• Yahoo! 24TB cube: Largest SSAS cube in production

@dennylee

Page 3: Spark to DocumentDB connector

A Brief Overview...

Page 4: Spark to DocumentDB connector

Elastically Scalable Throughput + Storage

Page 5: Spark to DocumentDB connector

Guaranteed low latency

Reads <10ms @ P99Writes <15ms @ P99

Page 6: Spark to DocumentDB connector

Globally Distributed

Page 7: Spark to DocumentDB connector

Speaks your language

Page 8: Spark to DocumentDB connector

DocumentDB

REST over HTTPS/TCP

MongoDB wire protocol

drivers for MongoDB

Java .NET

Java .NETRuby

Page 9: Spark to DocumentDB connector

Aggregations

Demo

Page 10: Spark to DocumentDB connector

Running Aggregations from Portal

Supports SUM, COUNT, MIN, MAX, AVG

Working on DISTINCT and GROUP BY

Page 11: Spark to DocumentDB connector

Data Sciences:Apache Spark + DocumentDB

Demo

Notebook View: https://aka.ms/docdb-spark-graphpyView: https://aka.ms/pydocdb-spark-graphCode: https://aka.ms/docdb-spark-graph-code

Page 12: Spark to DocumentDB connector

AdvantagesData Science Scenarios

• Distributed Aggregations and Analytics

• Blazing Fast IoT Scenarios

• Updateable columns

• Push-down predicate filtering

Page 13: Spark to DocumentDB connector

AdvantagesDistributed Aggregations and Analytics

Page 14: Spark to DocumentDB connector

AdvantagesBlazing Fast IoT Scenarios

Flight information

global safetyalerts

weather

Data Science Scenarios

Device Notifications

Web / REST API

Page 15: Spark to DocumentDB connector

AdvantagesUpdateable Columns

Flight information

Data Science Scenarios

Device Notifications

Web / REST API

{ tripid: “100100”, delay: -5, time: “01:00:01”}

{ tripid: “100100”, delay: -30, time: “01:00:01”}

{delay:-30}

{delay:-30}

{delay:-30}

Page 16: Spark to DocumentDB connector

AdvantagesPushdown Predicate Filtering Data Science Scenarios

{city:SEA}

locations headquarter exports

0 1

country

Germany

city

Seattle

country

France

city

Paris

city

Moscow

city

Athens

Belgium 0 1 {city:SEA, dst: POR, ...},{city:SEA, dst: JFK, ...}, {city:SEA, dst: SFO, ...}, {city:SEA, dst: YVR, ...}, {city:SEA, dst: YUL, ...}, ...

Page 17: Spark to DocumentDB connector

gateway node data

nodes

master node

worker nodes

pyDocumentDB

1

2

3

pyDocumentDB1.Connection is between

Spark master node and DocumentDB gateway node.

2.Query is submitted from DocumentDB gateway node to data nodes. Results are sent back to the gateway node and then transmitted back to the Spark master node.

3.Spark master node converts the dictionary to a DataFrame and distributed out to the worker nodes.

Page 18: Spark to DocumentDB connector

gateway node data

nodes

master node

worker nodes

Spark-DocumentDBConnector (Java)

1

3

2

4

Spark to DocumentDB Connector

1.Connection is between Spark master node and

2.map data is transmitted back to DocumentDB gateway node

3.Query is submitted from Spark worker nodes to

4.DocumentDB data nodes and the data is transmitted back to Spark worker nodes for further processing

Page 19: Spark to DocumentDB connector

Query Test Results

Query pyDocumentDB Azure-DocumentDB-Spark

LIMIT 100 0:00:00.774820 00:00:01.286

All Seattle flights (23K rows)

0:00:05.146107 00:00:01.582

All flights (~1.39M rows) 0:02:36.335267 00:00:08.899

More info at: https://github.com/Azure/azure-documentdb-spark/wiki/Query-Test-Runs

Page 20: Spark to DocumentDB connector

Query Test Results

Issue # Issue Description7 Improve push down predicates (e.g. take advantage of TOP/LIMIT,

aggregations, etc.)6 Schema-less query bug5 Optimize computation push to partitions3 Add Python wrapper / examples2 Add Azure-DocumentDB-Spark connector as Spark package

More info at: https://github.com/Azure/azure-documentdb-spark/issues

Page 21: Spark to DocumentDB connector

AsksGo to https://github.com/Azure/azure-documentdb-spark/ and try it out!

References:• Real-time machine learning on globally-distributed data with Apa

che Spark and DocumentDB• Accelerate real-time big-data analytics with the Spark to Docum

entDB connector

Any questions?• We’re on StackOverflow #azure-documentdb• Email askdocdb@ or denny.lee@

Page 22: Spark to DocumentDB connector

Data Sciences:Apache Spark + DocumentDB

Page 23: Spark to DocumentDB connector

Example: Graph Structures

Page 24: Spark to DocumentDB connector

Example: Graph Structures

Page 25: Spark to DocumentDB connector

Graph Calculations: Degrees, PageRank

What is the most important airport (most flights in / out)

tripGraph.inDegrees\

.sort(desc("inDegree"))\

.limit(10))

Page 26: Spark to DocumentDB connector

Classic Graph Scenario: Flights

vertex = airports

edges = flights