Solving low latency query over big data with Spark SQL

Solving low latency query over big data with Spark SQLJulien Pierre

• Intro• Project’s mission• Use case• Learnings• Implementation details• The future of Spark in ASG

Agenda

• Senior Product Manager• ASG Shared Data team• Beijing• Working on:• Spark• Driving Microsoft internal Spark

community

• Worked on:• Office 365 Admin• Office 365 Customer Facing Reporting• Open sourced the Office 365 Splunk

App

Who am I?

Julien Pierre朱力安

What we do

Client DataFluency

Office

Skype

Bing

Modern Data Capability

Instrumentation & Ingestion

Processing & Storage

Reporting & Analytics

Information Management

Mobile-First Analytics Experience

Experimentation

Deliver interactive analytics capabilities for ASG teams to build great products

Increase analyst and engineer productivityCapitalize on our dataLeverage open technology to run a stable service

Project mission

Use Case

Where does Spark SQL help us?

Data Size

Query Latency

Kuntao Yu, engineerCustomer found an issue with the data“Why is the Created Date field = 0001/01/01?”

Is it by design or a bug?71.2 TB of Office Data in Spark

Data issue troubleshooting use case

What did Kuntao have to do? Writing & compiling query

Submitting & waiting in queueJob runtime

Get results inline in Zeppelin

Check data in Excel

Need to open the results in

Excel

1. Check affected records 2. Find root cause 3. Assert findings

Elapsed time

Cosmos

SparkSQL

SparkSQL with Cache

0 20 40 60 80 100 120 140 160 180 200

Write and Compile Query Submit and Wait in Job Queue Job Run Time

What did we learn in the process?

Enable self-serviceMinimize time to insightNotebooks at scale

Learnings

How did we do it?

Architecture

Mesos Cluster/HDFS

Job ManagerZookeepe

r

Job Frontend Web API

Spark Driver Host Pool

Spark Hive Thrift Server

Zeppelin Server

Avocado(Hive Query + Schedule Task)

Rover(Drag & Drop BI tool with Hive Code

Gen)

Zeppelin Web UI

MetastoreDBHive

Loader

Cosmos Storage

Open Source

Microsoft

Cosmos -> HDFS

Partition 1

Partition 2

...

Partition n

Export Cosmos Partition

Partition 1

Partition 2

...

Partition n

Task 2

HDFS.copyFromLocalFile

...

Task n

Partition 1

Partition 2

...

Partition n

saveAsParquetFile

Task 2

...

Task n

Hive Loader

<Database2>

<Table1><Database1>

<Partition1>

<Table2><Partition2>

MetastoreDB

Hive Thrift Server

Hive Loader

Zeppelin Server

UserQueryQuery

User Experience in Avocado Task

User Experience in Avocado Task (Cont.)



What next?

Analytics as a Service in ASG

Data Inges

tServices

Clients

Transform Compute

Transform

Compute

Data Streams

Data Sets

Store

EventProcessing

HDFSDataTransportatio

n

Spark Streaming Receiver

Analyst

ZeppelinNotebooks

Avocado

Simple query

Querylanguage

“Analyze”

“Debug”

“Mine”

“Glance”

Data

Unified platform

Intelligence

Interactive analytics

Data Products

Better Digital

Experiences

Dual users

“Bing”

“Office”Synchronou

s

experience

Asynchronous

experience

The #DataAnalystHub

Questions?

@JulienJTPierrehttps://linkedin.com/in/julienjtpierre

Technology

Solving low latency query over big data with Spark SQL