24
Solving low latency query over big data with Spark SQL Julien Pierre

Solving low latency query over big data with Spark SQL

Embed Size (px)

Citation preview

Page 1: Solving low latency query over big data with Spark SQL

Solving low latency query over big data with Spark SQLJulien Pierre

Page 2: Solving low latency query over big data with Spark SQL

• Intro• Project’s mission• Use case• Learnings• Implementation details• The future of Spark in ASG

Agenda

Page 3: Solving low latency query over big data with Spark SQL

• Senior Product Manager• ASG Shared Data team• Beijing• Working on:• Spark• Driving Microsoft internal Spark

community

• Worked on:• Office 365 Admin• Office 365 Customer Facing Reporting• Open sourced the Office 365 Splunk

App

Who am I?

Julien Pierre朱力安

Page 4: Solving low latency query over big data with Spark SQL

What we do

Client DataFluency

Office

Skype

Bing

Modern Data Capability

Instrumentation & Ingestion

Processing & Storage

Reporting & Analytics

Information Management

Mobile-First Analytics Experience

Experimentation

Page 5: Solving low latency query over big data with Spark SQL

Deliver interactive analytics capabilities for ASG teams to build great products

Increase analyst and engineer productivityCapitalize on our dataLeverage open technology to run a stable service

Project mission

Page 6: Solving low latency query over big data with Spark SQL

Use Case

Page 7: Solving low latency query over big data with Spark SQL

Where does Spark SQL help us?

Data Size

Query Latency

Page 8: Solving low latency query over big data with Spark SQL

Kuntao Yu, engineerCustomer found an issue with the data“Why is the Created Date field = 0001/01/01?”

Is it by design or a bug?71.2 TB of Office Data in Spark

Data issue troubleshooting use case

Page 9: Solving low latency query over big data with Spark SQL

What did Kuntao have to do? Writing & compiling query

Submitting & waiting in queueJob runtime

Get results inline in Zeppelin

Check data in Excel

Need to open the results in

Excel

1. Check affected records 2. Find root cause 3. Assert findings

Page 10: Solving low latency query over big data with Spark SQL

Elapsed time

Cosmos

SparkSQL

SparkSQL with Cache

0 20 40 60 80 100 120 140 160 180 200

Write and Compile Query Submit and Wait in Job Queue Job Run Time

Page 11: Solving low latency query over big data with Spark SQL

What did we learn in the process?

Page 12: Solving low latency query over big data with Spark SQL

Enable self-serviceMinimize time to insightNotebooks at scale

Learnings

Page 13: Solving low latency query over big data with Spark SQL

How did we do it?

Page 14: Solving low latency query over big data with Spark SQL

Architecture

Mesos Cluster/HDFS

Job ManagerZookeepe

r

Job Frontend Web API

Spark Driver Host Pool

Spark Hive Thrift Server

Zeppelin Server

Avocado(Hive Query + Schedule Task)

Rover(Drag & Drop BI tool with Hive Code

Gen)

Zeppelin Web UI

MetastoreDBHive

Loader

Cosmos Storage

Open Source

Microsoft

Page 15: Solving low latency query over big data with Spark SQL

Cosmos -> HDFS

Partition 1

Partition 2

...

Partition n

Export Cosmos Partition

Partition 1

Partition 2

...

Partition n

Task 2

HDFS.copyFromLocalFile

...

Task n

Partition 1

Partition 2

...

Partition n

saveAsParquetFile

Task 2

...

Task n

Page 16: Solving low latency query over big data with Spark SQL

Hive Loader

<Database2>

<Table1><Database1>

<Partition1>

<Table2><Partition2>

MetastoreDB

Hive Thrift Server

Hive Loader

Zeppelin Server

UserQueryQuery

Page 17: Solving low latency query over big data with Spark SQL

User Experience in Avocado Task

Page 18: Solving low latency query over big data with Spark SQL

User Experience in Avocado Task (Cont.)

Page 19: Solving low latency query over big data with Spark SQL

User Experience in Avocado Task (Cont.)

Page 20: Solving low latency query over big data with Spark SQL

User Experience in Avocado Task (Cont.)

Page 21: Solving low latency query over big data with Spark SQL

What next?

Page 22: Solving low latency query over big data with Spark SQL

Analytics as a Service in ASG

Data Inges

tServices

Clients

Transform Compute

Transform

Compute

Data Streams

Data Sets

Store

EventProcessing

HDFSDataTransportatio

n

Spark Streaming Receiver

Analyst

ZeppelinNotebooks

Avocado

Simple query

Querylanguage

“Analyze”

“Debug”

“Mine”

“Glance”

Data

Unified platform

Intelligence

Interactive analytics

Data Products

Better Digital

Experiences

Dual users

“Bing”

“Office”Synchronou

s

experience

Asynchronous

experience

Page 23: Solving low latency query over big data with Spark SQL

The #DataAnalystHub

Page 24: Solving low latency query over big data with Spark SQL

Questions?

@JulienJTPierrehttps://linkedin.com/in/julienjtpierre