Upload
julien-pierre
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Solving low latency query over big data with Spark SQLJulien Pierre
• Intro• Project’s mission• Use case• Learnings• Implementation details• The future of Spark in ASG
Agenda
• Senior Product Manager• ASG Shared Data team• Beijing• Working on:• Spark• Driving Microsoft internal Spark
community
• Worked on:• Office 365 Admin• Office 365 Customer Facing Reporting• Open sourced the Office 365 Splunk
App
Who am I?
Julien Pierre朱力安
What we do
Client DataFluency
Office
Skype
Bing
Modern Data Capability
Instrumentation & Ingestion
Processing & Storage
Reporting & Analytics
Information Management
Mobile-First Analytics Experience
Experimentation
Deliver interactive analytics capabilities for ASG teams to build great products
Increase analyst and engineer productivityCapitalize on our dataLeverage open technology to run a stable service
Project mission
Use Case
Where does Spark SQL help us?
Data Size
Query Latency
Kuntao Yu, engineerCustomer found an issue with the data“Why is the Created Date field = 0001/01/01?”
Is it by design or a bug?71.2 TB of Office Data in Spark
Data issue troubleshooting use case
What did Kuntao have to do? Writing & compiling query
Submitting & waiting in queueJob runtime
Get results inline in Zeppelin
Check data in Excel
Need to open the results in
Excel
1. Check affected records 2. Find root cause 3. Assert findings
Elapsed time
Cosmos
SparkSQL
SparkSQL with Cache
0 20 40 60 80 100 120 140 160 180 200
Write and Compile Query Submit and Wait in Job Queue Job Run Time
What did we learn in the process?
Enable self-serviceMinimize time to insightNotebooks at scale
Learnings
How did we do it?
Architecture
Mesos Cluster/HDFS
Job ManagerZookeepe
r
Job Frontend Web API
Spark Driver Host Pool
Spark Hive Thrift Server
Zeppelin Server
Avocado(Hive Query + Schedule Task)
Rover(Drag & Drop BI tool with Hive Code
Gen)
Zeppelin Web UI
MetastoreDBHive
Loader
Cosmos Storage
Open Source
Microsoft
Cosmos -> HDFS
Partition 1
Partition 2
...
Partition n
Export Cosmos Partition
Partition 1
Partition 2
...
Partition n
Task 2
HDFS.copyFromLocalFile
...
Task n
Partition 1
Partition 2
...
Partition n
saveAsParquetFile
Task 2
...
Task n
Hive Loader
<Database2>
<Table1><Database1>
<Partition1>
<Table2><Partition2>
MetastoreDB
Hive Thrift Server
Hive Loader
Zeppelin Server
UserQueryQuery
User Experience in Avocado Task
User Experience in Avocado Task (Cont.)
User Experience in Avocado Task (Cont.)
User Experience in Avocado Task (Cont.)
What next?
Analytics as a Service in ASG
Data Inges
tServices
Clients
Transform Compute
Transform
Compute
Data Streams
Data Sets
Store
EventProcessing
HDFSDataTransportatio
n
Spark Streaming Receiver
Analyst
ZeppelinNotebooks
Avocado
Simple query
Querylanguage
“Analyze”
“Debug”
“Mine”
“Glance”
Data
Unified platform
Intelligence
Interactive analytics
Data Products
Better Digital
Experiences
Dual users
“Bing”
“Office”Synchronou
s
experience
Asynchronous
experience
The #DataAnalystHub
Questions?
@JulienJTPierrehttps://linkedin.com/in/julienjtpierre