View
6
Download
0
Category
Preview:
Citation preview
Query Execution in Column-Oriented Database Systems
Daniel AbadiSIGMOD Jim Gray Doctoral Dissertation Award
Research Advisor: Sam Madden
July 2 nd, 2009
SIGMOD 2009 Daniel Abadi - Yale University 2
Acknowledgments
� Mike Stonebraker (C-Store paper)� Miguel Ferreira (Compression Paper)� Stavros Harizopoulos (Performance
Paper)� Daniel Myers, David DeWitt (Tuple
reconstruction paper)� Adam Marcus (RDF Paper)� Nabil Hachem (Column-stores vs row-
stores paper)
SIGMOD 2009 Daniel Abadi - Yale University 3
Acknowledgments
� C-Store Team• Brandeis University
• Mitch Cherniack• Nga Tran • Adam Batkin• Tien Hoang
• Brown University• Stan Zdonik• Alexander Rasin• Tingjian Ge
• UMass Boston• Pat O’Neil• Betty O’Neil• Xuedong Chen
• MIT (people on previous page, plus Amerson Lin, Edmond Lau, and Velen Liang)
SIGMOD 2009 Daniel Abadi - Yale University 4
Acknowledgments
Sam Madden before serving as my PhD Advisor
Sam Madden after getting me through the PhD program
SIGMOD 2009 Daniel Abadi - Yale University 5
Related Dissertations
� P.A. Boncz. Monet: A Next Generation DBMS Kernel For Query-Intensive Applications (2002). PhD Advisor: Martin Kersten.
� Coming soon:• Marcin Zukowski (CWI)• Stratos Idreos (CWI)
SIGMOD 2009 Daniel Abadi - Yale University 6
Row vs. Column-Stores
StreetAddressPhone #E-mail
FirstName
LastName
Last Name
FirstName E-mail Phone #
StreetAddress
Row Store Column Store
− Might read in unnecessary data
+ Only need to read in relevant attributes
+ Easy to add a new record
− Tuple writes might require multiple seeks
SIGMOD 2009 Daniel Abadi - Yale University 7
Column-Stores
� Really good for read-mostly data warehouses• Lot’s of column scans and aggregations• Writes tend to be in batch
SIGMOD 2009 Daniel Abadi - Yale University 8
Column-Store History
StreetAddressPhone #E-mail
FirstName
LastName
Last Name
FirstName E-mail
1
2
3
1
2
3
1
2
3
1970s/80s: Decomposed Storage Model (DSM)/Vertical
Partitioning
…
SIGMOD 2009 Daniel Abadi - Yale University 9
SSBM Averages
79.9
11.7
25.7
0
10
20
30
40
50
60
70
80
90
Row-Store Row-Store (Minimum I/O) Row-Store (Vertical lyPartitioned)
SIGMOD 2009 Daniel Abadi - Yale University 10
Tuple Size
1
2
3
ColumnData
TID
1
2
3
TID ColumnData
1
2
3
TID ColumnData
TupleHeader
•Queries touch 3 - 4 foreign keys in fact table, 1 - 2 numeric columns
•Complete fact table takes up ~4 GB (compressed)
•Vertically partitioned tables take up 0.7 - 1.1 GB (compressed)
SIGMOD 2009 Daniel Abadi - Yale University 11
Query Optimizer
� Each column is a separate table
� Each attribute accessed in a query becomes a join
� Optimizer becomes hopelessly overwhelmed
SIGMOD 2009 Daniel Abadi - Yale University 12
Tuple ID problem easy to solve
Last Name
FirstName E-mail
1
2
3
1
2
3
1
2
3
…
� Store columns in original tuple order, remove Tuple IDs
� Store tuple header in separate column
SIGMOD 2009 Daniel Abadi - Yale University 13
Heuristics To Solve Optimizer Problem
� For each query, figure out what attributes will be accessed
� For each table, first join all accessed attributes into normal row-store tuples for that table
� Then proceed with regular row-store optimization and execution
SIGMOD 2009 Daniel Abadi - Yale University 14
Result After These Simple Fixes
25.7
11.7
79.9
40.7
0
10
20
30
40
50
60
70
80
90
Row-Store Row-Store(Minimum I/O)
Row-Store(Vertically
Partitioned)
Column-Store(Storage Layer)
SIGMOD 2009 Daniel Abadi - Yale University 15
Final Result After Dissertation
25.7
11.7
79.9
40.7
4.40
10
20
30
40
50
60
70
80
90
Row-Store Row-Store(Minimum I/O)
Row-Store(Vertically
Partitioned)
Column-Store(Storage Layer)
Column-Store(Complete)
Central Focus ofDissertation
SIGMOD 2009 Daniel Abadi - Yale University 16
Overview Of Dissertation
ID Age Sal Dept.ID Loc. Size
DepartmentEmployee
Construct
User SELECT AVG(salary)FROM EmployeeWHERE Age > 50Aggregate
Select
Architecture: Storage Layer, Data Flow, High Performance Data Access and Iteration
1
Build Compression into Executor2
Carefully Placed Construction Operators
3
Benchmark Performance on SSBM
4
Benchmark Performance on RDF
5
SIGMOD 2009 Daniel Abadi - Yale University 17
Compression
� Trades I/O for CPU� Increased column-store opportunities:
• Higher data value locality in column stores• More Techniques Useful• Can use extra space to store multiple
copies of data in different sort orders
SIGMOD 2009 Daniel Abadi - Yale University 18
Column-Oriented Compression
Q1Q1Q1Q1Q1Q1Q1
Q2Q2Q2Q2
…
…
1111122
1112
…
…
DayQuarter (Q1, 1, 300)
(1, 1, 5)
…
…
Day
Quarter
(Q2, 301, 350)
(Q3, 651, 500)
(Q4, 1151, 600)
(2, 6, 2)
(1, 301, 3)(2, 304, 1)
ElectronicsCategory
5729685
3814
…
…
Price
5729685
3814
…
…
Price
ClothingElectronics
FoodFoodFood
Clothing
ElectronicsElectronics
ClothingElectronics
…
1Category
2 13 3 32 11 2 1
…
AND/OR
1010000 … 1101Electronics
0100001 … 0010Clothing
0001110 … 0000Food
Run-Length Encoding
Run-Length Encoding
Dictionary Encoding
Bit-Vector Encoding
SIGMOD 2009 Daniel Abadi - Yale University 19
Other Compression Techniques
� Null Suppression� Differential or Frame of Reference� Huffman� Arithmetic� Lempel-Ziv
SIGMOD 2009 Daniel Abadi - Yale University 20
� I/O - CPU tradeoff is no longer a tradeoff
� Reduces memory–CPU bandwidth requirements
� Opens up possibility of operating on multiple records at once
Operating Directly onCompressed Data
SIGMOD 2009 Daniel Abadi - Yale University 21
Operating Directly onCompressed Data
10011001000
…
…
Product IDQuarter
(1, 3)
(2, 1)
(3, 2)
1
00100010010
…
…
2
01000100101
…
…
3ProductID, COUNT(*))
(Q1, 1, 300)
(Q2, 301, 6)
(Q3, 307, 500)
(Q4, 807, 600)
Index Lookup + Offset jump
SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID
301-306
SIGMOD 2009 Daniel Abadi - Yale University 22
Operating Directly onCompressed Data
(Q1, 1, 300)
(Q2, 301, 6)
(Q3, 307, 500)
(Q4, 807, 600)
Data
isOneValue()isValueSorted()isPosContiguous()isSparse()
getNext()decompressIntoArray()getValueAtPosition(pos)
applyPredicate(pred)getSize()
Block API
Data SourceOperator
SelectionOperator
AggregationOperator
SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID
SIGMOD 2009 Daniel Abadi - Yale University 23
When should tuples be constructed?
� At some point in query plan, tuples must be constructed, “materialization”
• Early materialization = create rows at beginning of query plan
• Late materialization = operate on columns as long as possible
SIGMOD 2009 Daniel Abadi - Yale University 24
When should tuples be constructed?
� Solution 1: Create rows first (EM). But:• Need to construct ALL tuples• Need to decompress data• Poor memory bandwidth
utilization
2131
2333
7134280
(4,1,4)
Construct
2
3
3
3
7
13
42
80
Select + Aggregate
2
1
3
1
4
4
4
4
prodID storeID custID price
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
SIGMOD 2009 Daniel Abadi - Yale University 25
DataSourceprodID
DataSourcestoreID
AND
AGG
DataSourcecustID
DataSourceprice
1111
4444
0101
2131
prodID storeID
DataSource
DataSource
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
2131
2333
7134280
4444
prodID storeID custID price
Solution 2: Operate on columns
SIGMOD 2009 Daniel Abadi - Yale University 26
Solution 2: Operate on columns
1111
0101
AND
0101
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
AND
AGG
DataSourcecustID
DataSourceprice
2131
2333
7134280
4444
prodID storeID custID price
DataSourceprodID
DataSourcestoreID
SIGMOD 2009 Daniel Abadi - Yale University 27
33
Solution 2: Operate on columns
0101
0101
DataSource
DataSource
0101
7134280
0101
2333
111380
custID price
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
AND
AGG
DataSourcecustID
DataSourceprice
2131
2333
7134280
4444
prodID storeID custID price
DataSourceprodID
DataSourcestoreID
SIGMOD 2009 Daniel Abadi - Yale University 28
Solution 2: Operate on columns
1133
111380
AGG
13 193
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
AND
AGG
DataSourcecustID
DataSourceprice
2131
2333
7134280
4444
prodID storeID custID price
DataSourceprodID
DataSourcestoreID
SIGMOD 2009 Daniel Abadi - Yale University 29
Materialization Chapter
� Describes how to implement LM and EM plans in a column-store (C-Store)
� Explores the tradeoffs between the two types of plans using:• An analytical model• Experimental results
� Presents heuristics to help a plan generator choose between EM and LM plans
SIGMOD 2009 Daniel Abadi - Yale University 30
Summary of Contributions
� Introduced a query executer that:• Is optimized for column-oriented data layout• Designed for compressed data• Constructs tuples at the right time
� Bunch of other innovations• Invisible Join• Optional block processing• Pseudo-iterative execution pipelining
� Orders of magnitude performance gains• Factor of 9-10 faster than naïve column-store• 1-2 orders of magnitude faster than row-store
SIGMOD 2009 Daniel Abadi - Yale University 31
For More, See:
� C-Store: A Column-Oriented DBMS (VLDB 2005)� Integrating Compression and Execution in Column-
Oriented Database Systems (SIGMOD 2006)� Performance Tradeoffs in Read-Optimized Databases
(VLDB 2006)� Materialization Strategies in a Column-Oriented
DBMS (ICDE 2007)� Column Stores for Wide and Sparse Data (CIDR 2007)� Scalable Semantic Web Data Management Using
Vertical Partitioning (VLDB 2007)� Column-Stores vs. Row-Stores: How Different Are
They Really? (SIGMOD 2008)� SW-Store: A Vertically Partitioned DBMS for Semanti c
Web Data Management (VLDBJ 2009)
SIGMOD 2009 Daniel Abadi - Yale University 32
Come Join the Yale DB Group!
SIGMOD 2009 Daniel Abadi - Yale University 33
More Plugs
� Blog:• http://dbmsmusings.blogspot.com/
� Twitter:• @abadid
� New Research Project:• HadoopDB (see VLDB paper)
Recommended