33
Query Execution in Column- Oriented Database Systems Daniel Abadi SIGMOD Jim Gray Doctoral Dissertation Award Research Advisor: Sam Madden July 2 nd , 2009

Query Execution in Column- Oriented Database Systemsabadi/talks/abadi-sigmod-award.pdf · Query Execution in Column-Oriented Database Systems Daniel Abadi SIGMOD Jim Gray Doctoral

  • Upload
    others

  • View
    26

  • Download
    0

Embed Size (px)

Citation preview

Query Execution in Column-Oriented Database Systems

Daniel AbadiSIGMOD Jim Gray Doctoral Dissertation Award

Research Advisor: Sam Madden

July 2 nd, 2009

SIGMOD 2009 Daniel Abadi - Yale University 2

Acknowledgments

� Mike Stonebraker (C-Store paper)� Miguel Ferreira (Compression Paper)� Stavros Harizopoulos (Performance

Paper)� Daniel Myers, David DeWitt (Tuple

reconstruction paper)� Adam Marcus (RDF Paper)� Nabil Hachem (Column-stores vs row-

stores paper)

SIGMOD 2009 Daniel Abadi - Yale University 3

Acknowledgments

� C-Store Team• Brandeis University

• Mitch Cherniack• Nga Tran • Adam Batkin• Tien Hoang

• Brown University• Stan Zdonik• Alexander Rasin• Tingjian Ge

• UMass Boston• Pat O’Neil• Betty O’Neil• Xuedong Chen

• MIT (people on previous page, plus Amerson Lin, Edmond Lau, and Velen Liang)

SIGMOD 2009 Daniel Abadi - Yale University 4

Acknowledgments

Sam Madden before serving as my PhD Advisor

Sam Madden after getting me through the PhD program

SIGMOD 2009 Daniel Abadi - Yale University 5

Related Dissertations

� P.A. Boncz. Monet: A Next Generation DBMS Kernel For Query-Intensive Applications (2002). PhD Advisor: Martin Kersten.

� Coming soon:• Marcin Zukowski (CWI)• Stratos Idreos (CWI)

SIGMOD 2009 Daniel Abadi - Yale University 6

Row vs. Column-Stores

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail Phone #

StreetAddress

Row Store Column Store

− Might read in unnecessary data

+ Only need to read in relevant attributes

+ Easy to add a new record

− Tuple writes might require multiple seeks

SIGMOD 2009 Daniel Abadi - Yale University 7

Column-Stores

� Really good for read-mostly data warehouses• Lot’s of column scans and aggregations• Writes tend to be in batch

SIGMOD 2009 Daniel Abadi - Yale University 8

Column-Store History

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail

1

2

3

1

2

3

1

2

3

1970s/80s: Decomposed Storage Model (DSM)/Vertical

Partitioning

SIGMOD 2009 Daniel Abadi - Yale University 9

SSBM Averages

79.9

11.7

25.7

0

10

20

30

40

50

60

70

80

90

Row-Store Row-Store (Minimum I/O) Row-Store (Vertical lyPartitioned)

SIGMOD 2009 Daniel Abadi - Yale University 10

Tuple Size

1

2

3

ColumnData

TID

1

2

3

TID ColumnData

1

2

3

TID ColumnData

TupleHeader

•Queries touch 3 - 4 foreign keys in fact table, 1 - 2 numeric columns

•Complete fact table takes up ~4 GB (compressed)

•Vertically partitioned tables take up 0.7 - 1.1 GB (compressed)

SIGMOD 2009 Daniel Abadi - Yale University 11

Query Optimizer

� Each column is a separate table

� Each attribute accessed in a query becomes a join

� Optimizer becomes hopelessly overwhelmed

SIGMOD 2009 Daniel Abadi - Yale University 12

Tuple ID problem easy to solve

Last Name

FirstName E-mail

1

2

3

1

2

3

1

2

3

� Store columns in original tuple order, remove Tuple IDs

� Store tuple header in separate column

SIGMOD 2009 Daniel Abadi - Yale University 13

Heuristics To Solve Optimizer Problem

� For each query, figure out what attributes will be accessed

� For each table, first join all accessed attributes into normal row-store tuples for that table

� Then proceed with regular row-store optimization and execution

SIGMOD 2009 Daniel Abadi - Yale University 14

Result After These Simple Fixes

25.7

11.7

79.9

40.7

0

10

20

30

40

50

60

70

80

90

Row-Store Row-Store(Minimum I/O)

Row-Store(Vertically

Partitioned)

Column-Store(Storage Layer)

SIGMOD 2009 Daniel Abadi - Yale University 15

Final Result After Dissertation

25.7

11.7

79.9

40.7

4.40

10

20

30

40

50

60

70

80

90

Row-Store Row-Store(Minimum I/O)

Row-Store(Vertically

Partitioned)

Column-Store(Storage Layer)

Column-Store(Complete)

Central Focus ofDissertation

SIGMOD 2009 Daniel Abadi - Yale University 16

Overview Of Dissertation

ID Age Sal Dept.ID Loc. Size

DepartmentEmployee

Construct

User SELECT AVG(salary)FROM EmployeeWHERE Age > 50Aggregate

Select

Architecture: Storage Layer, Data Flow, High Performance Data Access and Iteration

1

Build Compression into Executor2

Carefully Placed Construction Operators

3

Benchmark Performance on SSBM

4

Benchmark Performance on RDF

5

SIGMOD 2009 Daniel Abadi - Yale University 17

Compression

� Trades I/O for CPU� Increased column-store opportunities:

• Higher data value locality in column stores• More Techniques Useful• Can use extra space to store multiple

copies of data in different sort orders

SIGMOD 2009 Daniel Abadi - Yale University 18

Column-Oriented Compression

Q1Q1Q1Q1Q1Q1Q1

Q2Q2Q2Q2

1111122

1112

DayQuarter (Q1, 1, 300)

(1, 1, 5)

Day

Quarter

(Q2, 301, 350)

(Q3, 651, 500)

(Q4, 1151, 600)

(2, 6, 2)

(1, 301, 3)(2, 304, 1)

ElectronicsCategory

5729685

3814

Price

5729685

3814

Price

ClothingElectronics

FoodFoodFood

Clothing

ElectronicsElectronics

ClothingElectronics

1Category

2 13 3 32 11 2 1

AND/OR

1010000 … 1101Electronics

0100001 … 0010Clothing

0001110 … 0000Food

Run-Length Encoding

Run-Length Encoding

Dictionary Encoding

Bit-Vector Encoding

SIGMOD 2009 Daniel Abadi - Yale University 19

Other Compression Techniques

� Null Suppression� Differential or Frame of Reference� Huffman� Arithmetic� Lempel-Ziv

SIGMOD 2009 Daniel Abadi - Yale University 20

� I/O - CPU tradeoff is no longer a tradeoff

� Reduces memory–CPU bandwidth requirements

� Opens up possibility of operating on multiple records at once

Operating Directly onCompressed Data

SIGMOD 2009 Daniel Abadi - Yale University 21

Operating Directly onCompressed Data

10011001000

Product IDQuarter

(1, 3)

(2, 1)

(3, 2)

1

00100010010

2

01000100101

3ProductID, COUNT(*))

(Q1, 1, 300)

(Q2, 301, 6)

(Q3, 307, 500)

(Q4, 807, 600)

Index Lookup + Offset jump

SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID

301-306

SIGMOD 2009 Daniel Abadi - Yale University 22

Operating Directly onCompressed Data

(Q1, 1, 300)

(Q2, 301, 6)

(Q3, 307, 500)

(Q4, 807, 600)

Data

isOneValue()isValueSorted()isPosContiguous()isSparse()

getNext()decompressIntoArray()getValueAtPosition(pos)

applyPredicate(pred)getSize()

Block API

Data SourceOperator

SelectionOperator

AggregationOperator

SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID

SIGMOD 2009 Daniel Abadi - Yale University 23

When should tuples be constructed?

� At some point in query plan, tuples must be constructed, “materialization”

• Early materialization = create rows at beginning of query plan

• Late materialization = operate on columns as long as possible

SIGMOD 2009 Daniel Abadi - Yale University 24

When should tuples be constructed?

� Solution 1: Create rows first (EM). But:• Need to construct ALL tuples• Need to decompress data• Poor memory bandwidth

utilization

2131

2333

7134280

(4,1,4)

Construct

2

3

3

3

7

13

42

80

Select + Aggregate

2

1

3

1

4

4

4

4

prodID storeID custID price

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

SIGMOD 2009 Daniel Abadi - Yale University 25

DataSourceprodID

DataSourcestoreID

AND

AGG

DataSourcecustID

DataSourceprice

1111

4444

0101

2131

prodID storeID

DataSource

DataSource

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

2131

2333

7134280

4444

prodID storeID custID price

Solution 2: Operate on columns

SIGMOD 2009 Daniel Abadi - Yale University 26

Solution 2: Operate on columns

1111

0101

AND

0101

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

AND

AGG

DataSourcecustID

DataSourceprice

2131

2333

7134280

4444

prodID storeID custID price

DataSourceprodID

DataSourcestoreID

SIGMOD 2009 Daniel Abadi - Yale University 27

33

Solution 2: Operate on columns

0101

0101

DataSource

DataSource

0101

7134280

0101

2333

111380

custID price

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

AND

AGG

DataSourcecustID

DataSourceprice

2131

2333

7134280

4444

prodID storeID custID price

DataSourceprodID

DataSourcestoreID

SIGMOD 2009 Daniel Abadi - Yale University 28

Solution 2: Operate on columns

1133

111380

AGG

13 193

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

AND

AGG

DataSourcecustID

DataSourceprice

2131

2333

7134280

4444

prodID storeID custID price

DataSourceprodID

DataSourcestoreID

SIGMOD 2009 Daniel Abadi - Yale University 29

Materialization Chapter

� Describes how to implement LM and EM plans in a column-store (C-Store)

� Explores the tradeoffs between the two types of plans using:• An analytical model• Experimental results

� Presents heuristics to help a plan generator choose between EM and LM plans

SIGMOD 2009 Daniel Abadi - Yale University 30

Summary of Contributions

� Introduced a query executer that:• Is optimized for column-oriented data layout• Designed for compressed data• Constructs tuples at the right time

� Bunch of other innovations• Invisible Join• Optional block processing• Pseudo-iterative execution pipelining

� Orders of magnitude performance gains• Factor of 9-10 faster than naïve column-store• 1-2 orders of magnitude faster than row-store

SIGMOD 2009 Daniel Abadi - Yale University 31

For More, See:

� C-Store: A Column-Oriented DBMS (VLDB 2005)� Integrating Compression and Execution in Column-

Oriented Database Systems (SIGMOD 2006)� Performance Tradeoffs in Read-Optimized Databases

(VLDB 2006)� Materialization Strategies in a Column-Oriented

DBMS (ICDE 2007)� Column Stores for Wide and Sparse Data (CIDR 2007)� Scalable Semantic Web Data Management Using

Vertical Partitioning (VLDB 2007)� Column-Stores vs. Row-Stores: How Different Are

They Really? (SIGMOD 2008)� SW-Store: A Vertically Partitioned DBMS for Semanti c

Web Data Management (VLDBJ 2009)

SIGMOD 2009 Daniel Abadi - Yale University 32

Come Join the Yale DB Group!

SIGMOD 2009 Daniel Abadi - Yale University 33

More Plugs

� Blog:• http://dbmsmusings.blogspot.com/

� Twitter:• @abadid

� New Research Project:• HadoopDB (see VLDB paper)