Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
S. Melnik, A. Gubarev, J. Long, G. Romer,
S. Shivakumar, M. Tolton
Google Inc.
VLDB 2010
Presented by Ke Hong
(slide adapted from Melnik’s) 1
Dremel: Interactive Analysis of Web-
Scale Datasets
Outline
2
Problem
Motivation
Nested Columnar Storage
Query Execution
Evaluation
Summary
Interactive Query
3
Run a MapReduce to extract billions of signals from web
pages
Launch Dremel to execute several interactive commands
and extract an irregularity in signal1
Feed the incoming input data into another MapReduce
pipeline
DEFINE TABLE t AS /path/to/data/*
SELECT TOP(signal1, 100), COUNT(*) FROM t
. . .
Problem
4
How to do SQL-like aggregation queries
On trillion-row tables
On petabytes of data
Using thousands of CPUs
Executed within seconds
Serving thousands of concurrent users
Motivation
5
Analysis of crawled web documents
Tracking install data for applications on Android Market
Crash reporting for Google products
Nested Columnar Storage
OCR results from Google Books
Spam analysis
Debugging of map tiles on Google Maps
Tablet migrations in managed Bigtable instances
Results of tests run on Google's distributed build system
Disk I/O statistics for hundreds of thousands of disks
Resource monitoring for jobs run in Google's data centers
Symbols and dependencies in Google's codebase
Motivation
6
Parallel DBMS
Limited scalability
Limited interactive speed
Not fault tolerant
Translating SQL into MapReduce jobs
HadoopDB, Pig Latin, Hive
One job takes thousands of seconds
Hard to support multi-user
Dremel Data Layout
7
Nested structure
τ = dom | <A1:τ[*|?], ..., An:τ[*|?]>
τ as an atomic or record type
* as repeated fields
? as optional fields
A
B
C D
E
*
*
*
. . .
. . .
r1
r2
r1
r2
r1
r2
r1
r2
column-oriented row-oriented
Read less,
cheaper
decompression
Row vs. Column
8
Execution time on 3000 nodes to process an 85 billion record table
87 TB 0.5 TB 0.5 TB
sec
Nested Data Model
9
message Document {
required int64 DocId; [1,1] optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
multiplicity:
Column-Striped Representation
10
value r d
10 0 0
20 0 0
DocId
value r d
http://A 0 2
http://B 1 2
NULL 1 1
http://C 0 2
Name.Url
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code Name.Language.Country
Links.Backward Links.Forward
value r d
us 0 3
NULL 2 2
NULL 1 1
gb 1 3
NULL 0 1
value r d
20 0 2
40 1 2
60 1 2
80 0 2
value r d
NULL 0 1
10 0 2
30 1 2
Repetition & Definition
Level
11
value r d
en-us 0 2
en
NULL
en-gb
NULL
Name.Language.Code
r1.Name1.Language1.Code: 'en-us'
r1.Name1.Language2.Code: 'en'
r1.Name2
r1.Name3.Language1.Code: 'en-gb'
r2.Name1
: common prefix
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
Repetition and definition levels encode the
structural delta between the current value
and the previous value.
r: length of common path prefix
d: length of the current path
Repetition & Definition
Level
12
value r d
en-us 0 2
en 2 2
NULL
en-gb
NULL
Name.Language.Code
r1.Name1.Language1.Code: 'en-us'
r1.Name1.Language2.Code: 'en'
r1.Name2
r1.Name3.Language1.Code: 'en-gb'
r2.Name1
: common prefix
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
Repetition and definition levels encode the
structural delta between the current value
and the previous value.
r: length of common path prefix
d: length of the current path
Repetition & Definition
Level
13
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb
NULL
Name.Language.Code
r1.Name1.Language1.Code: 'en-us'
r1.Name1.Language2.Code: 'en'
r1.Name2
r1.Name3.Language1.Code: 'en-gb'
r2.Name1
: common prefix
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
Repetition and definition levels encode the
structural delta between the current value
and the previous value.
r: length of common path prefix
d: length of the current path
Repetition & Definition
Level
14
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL
Name.Language.Code
r1.Name1.Language1.Code: 'en-us'
r1.Name1.Language2.Code: 'en'
r1.Name2
r1.Name3.Language1.Code: 'en-gb'
r2.Name1
: common prefix
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
Repetition and definition levels encode the
structural delta between the current value
and the previous value.
r: length of common path prefix
d: length of the current path
Repetition & Definition
Level
15
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code
r1.Name1.Language1.Code: 'en-us'
r1.Name1.Language2.Code: 'en'
r1.Name2
r1.Name3.Language1.Code: 'en-gb'
r2.Name1
: common prefix
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
Repetition and definition levels encode the
structural delta between the current value
and the previous value.
r: length of common path prefix
d: length of the current path
Record Assembly
16
Name.Language.Country Name.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1
0
1
0
0,1, 2
2
0,1 1
0
0
Transitions labeled with
repetition levels
Record Assembly
17
Simpler FSM to retrieve a subset of fields
Cheaper to execute
Useful for queries like
/Name[2]/Language[1]/Country
DocId
Name.Language.Country 1,2
0
0
DocId: 10
Name
Language
Country: 'us'
Language
Name
Name
Language
Country: 'gb'
DocId: 20
Name
s1
s2
Query Execution
18
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0
t1
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str;
}
}
}
Output table Output schema
Multi-Level Serving Tree
19 storage layer (e.g., GFS)
. . .
. . .
. . . leaf servers
(with local storage)
intermediate servers
root server
client
• Parallelize scheduling and
aggregation
• Mostly one-pass aggregation
• Well suited for aggregation
queries returning small and
medium-sized results (<1M
records)
• Common class of interactive
queries
Multi-Level Serving Tree
20
root server
client
• Receive a query from client,
determine all tablets
Multi-Level Serving Tree
21
. . . intermediate servers
root server
client
• Rewrite the query and route it
to intermediate servers
Multi-Level Serving Tree
22
. . .
. . .
. . . leaf servers
(with local storage)
intermediate servers
root server
client
• Rewrite the query into a set of
partial queries and assign them
to leaf servers
Multi-Level Serving Tree
23 storage layer (e.g., GFS)
. . .
. . .
. . . leaf servers
(with local storage)
intermediate servers
root server
client
• Access assigned tablets from the
storage layer
Example: count()
24
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (𝐑𝟏𝟏 UNION ALL … 𝐑𝐧
𝟏)
GROUP BY A
0
Example: count()
25
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (𝐑𝟏𝟏 UNION ALL … 𝐑𝐧
𝟏)
GROUP BY A
SELECT A, COUNT(B) AS c
FROM 𝐓𝟏𝟏 GROUP BY A
𝐓𝟏𝟏 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS c
FROM 𝐓𝟐𝟏 GROUP BY A
𝐓𝟐𝟏 = {/gfs/10001, …, /gfs/20000}
. . .
0
1
𝐑𝟏𝟏 𝐑𝟐
𝟏
Example: count()
26
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (𝐑𝟏𝟏 UNION ALL … 𝐑𝐧
𝟏)
GROUP BY A
SELECT A, COUNT(B) AS c
FROM 𝐓𝟏𝟏 GROUP BY A
𝐓𝟏𝟏 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS c
FROM 𝐓𝟐𝟏 GROUP BY A
𝐓𝟐𝟏 = {/gfs/10001, …, /gfs/20000}
SELECT A, COUNT(B) AS c
FROM 𝐓𝟑𝟏 GROUP BY A
𝐓𝟑𝟏 = {/gfs/1}
. . .
0
1
3
𝐑𝟏𝟏 𝐑𝟐
𝟏
Data access ops
. . .
Multi-Level Serving Tree
27 storage layer (e.g., GFS)
. . . leaf servers
(with local storage)
• Scan assigned tablets to return
partial results to intermediate
servers
Multi-Level Serving Tree
28 storage layer (e.g., GFS)
. . .
. . .
. . . leaf servers
(with local storage)
intermediate servers
• Parallel aggregation of partial
results
Multi-Level Serving Tree
29 storage layer (e.g., GFS)
. . .
. . .
. . . leaf servers
(with local storage)
intermediate servers
root server
client
• Return the aggregate query
result to client
Query Dispatcher
30
Dremel as a multi-user system
Schedule queries based on priorities and load balancing
Fault tolerance
When the number of tablets processed in one query is
larger than the number of available execution threads on
leaf servers
Compute a histogram of tablet processing times
Reschedule tablet processing on another server if a tablet takes
disproportionally long time
Internal execution tree as a physical query execution plan
Evaluation
31
Scalability of Dremel execution time (sec)
number of leaf servers
SELECT TOP(aid, 20), COUNT(*) FROM T4
WHERE bid = {value1} AND cid = {value2}
On a trillion-row table T4 (105TB, 50 fields):
Evaluation
32
Interactive speed of Dremel
execution time (sec)
percentage
of queries
Most queries complete within 10 sec
Monthly query workload
of one 3000-node Dremel
instance
Summary
33
Propose a scenario for interactive queries on web-scale
datasets
Design a columnar storage for Dremel’s nested data
model
Develop SQL-like Dremel’s query language
Design a multi-level serving tree for efficient query
execution
Evaluate Dremel’s scalability and interactive speed using
web-scale workloads