Query Execution in Column- Oriented Database Systemsabadi/talks/abadi-sigmod-award.pdf · 0 10 20...

Query Execution in Column-Oriented Database Systems

Daniel AbadiSIGMOD Jim Gray Doctoral Dissertation Award

Research Advisor: Sam Madden

July 2 nd, 2009

SIGMOD 2009 Daniel Abadi - Yale University 2

Acknowledgments

� Mike Stonebraker (C-Store paper)� Miguel Ferreira (Compression Paper)� Stavros Harizopoulos (Performance

Paper)� Daniel Myers, David DeWitt (Tuple

reconstruction paper)� Adam Marcus (RDF Paper)� Nabil Hachem (Column-stores vs row-

stores paper)

Acknowledgments

� C-Store Team• Brandeis University

• Mitch Cherniack• Nga Tran • Adam Batkin• Tien Hoang

• Brown University• Stan Zdonik• Alexander Rasin• Tingjian Ge

• UMass Boston• Pat O’Neil• Betty O’Neil• Xuedong Chen

• MIT (people on previous page, plus Amerson Lin, Edmond Lau, and Velen Liang)

Acknowledgments

Sam Madden before serving as my PhD Advisor

Sam Madden after getting me through the PhD program

Related Dissertations

� P.A. Boncz. Monet: A Next Generation DBMS Kernel For Query-Intensive Applications (2002). PhD Advisor: Martin Kersten.

� Coming soon:• Marcin Zukowski (CWI)• Stratos Idreos (CWI)

Row vs. Column-Stores

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail Phone #

StreetAddress

Row Store Column Store

− Might read in unnecessary data

+ Only need to read in relevant attributes

+ Easy to add a new record

− Tuple writes might require multiple seeks

Column-Stores

� Really good for read-mostly data warehouses• Lot’s of column scans and aggregations• Writes tend to be in batch

Column-Store History

StreetAddressPhone #E-mail

FirstName

LastName

Last Name

FirstName E-mail

1970s/80s: Decomposed Storage Model (DSM)/Vertical

Partitioning

SSBM Averages

Row-Store Row-Store (Minimum I/O) Row-Store (Vertical lyPartitioned)

Tuple Size

ColumnData

TID ColumnData

TupleHeader

•Queries touch 3 - 4 foreign keys in fact table, 1 - 2 numeric columns

•Complete fact table takes up ~4 GB (compressed)

•Vertically partitioned tables take up 0.7 - 1.1 GB (compressed)

Query Optimizer

� Each column is a separate table

� Each attribute accessed in a query becomes a join

� Optimizer becomes hopelessly overwhelmed

Tuple ID problem easy to solve

Last Name

FirstName E-mail

� Store columns in original tuple order, remove Tuple IDs

� Store tuple header in separate column

Heuristics To Solve Optimizer Problem

� For each query, figure out what attributes will be accessed

� For each table, first join all accessed attributes into normal row-store tuples for that table

� Then proceed with regular row-store optimization and execution

Result After These Simple Fixes

Row-Store Row-Store(Minimum I/O)

Row-Store(Vertically

Partitioned)

Column-Store(Storage Layer)

Final Result After Dissertation

Row-Store Row-Store(Minimum I/O)

Row-Store(Vertically

Partitioned)

Column-Store(Storage Layer)

Column-Store(Complete)

Central Focus ofDissertation

Overview Of Dissertation

ID Age Sal Dept.ID Loc. Size

DepartmentEmployee

Construct

User SELECT AVG(salary)FROM EmployeeWHERE Age > 50Aggregate

Select

Architecture: Storage Layer, Data Flow, High Performance Data Access and Iteration

Build Compression into Executor2

Carefully Placed Construction Operators

Benchmark Performance on SSBM

Benchmark Performance on RDF

Compression

� Trades I/O for CPU� Increased column-store opportunities:

• Higher data value locality in column stores• More Techniques Useful• Can use extra space to store multiple

copies of data in different sort orders

Column-Oriented Compression

Q1Q1Q1Q1Q1Q1Q1

Q2Q2Q2Q2

1111122

DayQuarter (Q1, 1, 300)

(1, 1, 5)

Quarter

(Q2, 301, 350)

(Q3, 651, 500)

(Q4, 1151, 600)

(2, 6, 2)

(1, 301, 3)(2, 304, 1)

ElectronicsCategory

5729685

ClothingElectronics

FoodFoodFood

Clothing

ElectronicsElectronics

ClothingElectronics

1Category

2 13 3 32 11 2 1

AND/OR

1010000 … 1101Electronics

0100001 … 0010Clothing

0001110 … 0000Food

Run-Length Encoding

Dictionary Encoding

Bit-Vector Encoding

Other Compression Techniques

� Null Suppression� Differential or Frame of Reference� Huffman� Arithmetic� Lempel-Ziv

� I/O - CPU tradeoff is no longer a tradeoff

� Reduces memory–CPU bandwidth requirements

� Opens up possibility of operating on multiple records at once

Operating Directly onCompressed Data

10011001000

Product IDQuarter

(1, 3)

(2, 1)

(3, 2)

00100010010

01000100101

3ProductID, COUNT(*))

(Q1, 1, 300)

(Q2, 301, 6)

(Q3, 307, 500)

(Q4, 807, 600)

Index Lookup + Offset jump

SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID

301-306

(Q1, 1, 300)

(Q2, 301, 6)

(Q3, 307, 500)

(Q4, 807, 600)

isOneValue()isValueSorted()isPosContiguous()isSparse()

getNext()decompressIntoArray()getValueAtPosition(pos)

applyPredicate(pred)getSize()

Block API

Data SourceOperator

SelectionOperator

AggregationOperator

SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID

When should tuples be constructed?

� At some point in query plan, tuples must be constructed, “materialization”

• Early materialization = create rows at beginning of query plan

• Late materialization = operate on columns as long as possible

When should tuples be constructed?

� Solution 1: Create rows first (EM). But:• Need to construct ALL tuples• Need to decompress data• Poor memory bandwidth

utilization

7134280

(4,1,4)

Construct

Select + Aggregate

prodID storeID custID price

QUERY:

SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND

(storeID = 1) ANDGROUP BY custID

DataSourceprodID

DataSourcestoreID

DataSourcecustID

DataSourceprice

prodID storeID

DataSource

QUERY:

7134280

Solution 2: Operate on columns

QUERY:

DataSourcecustID

DataSourceprice

7134280

DataSourceprodID

DataSourcestoreID

DataSource

7134280

111380

custID price

QUERY:

DataSourcecustID

DataSourceprice

7134280

DataSourceprodID

DataSourcestoreID

111380

13 193

QUERY:

DataSourcecustID

DataSourceprice

7134280

DataSourceprodID

DataSourcestoreID

Materialization Chapter

� Describes how to implement LM and EM plans in a column-store (C-Store)

� Explores the tradeoffs between the two types of plans using:• An analytical model• Experimental results

� Presents heuristics to help a plan generator choose between EM and LM plans

Summary of Contributions

� Introduced a query executer that:• Is optimized for column-oriented data layout• Designed for compressed data• Constructs tuples at the right time

� Bunch of other innovations• Invisible Join• Optional block processing• Pseudo-iterative execution pipelining

� Orders of magnitude performance gains• Factor of 9-10 faster than naïve column-store• 1-2 orders of magnitude faster than row-store

For More, See:

� C-Store: A Column-Oriented DBMS (VLDB 2005)� Integrating Compression and Execution in Column-

Oriented Database Systems (SIGMOD 2006)� Performance Tradeoffs in Read-Optimized Databases

(VLDB 2006)� Materialization Strategies in a Column-Oriented

DBMS (ICDE 2007)� Column Stores for Wide and Sparse Data (CIDR 2007)� Scalable Semantic Web Data Management Using

Vertical Partitioning (VLDB 2007)� Column-Stores vs. Row-Stores: How Different Are

They Really? (SIGMOD 2008)� SW-Store: A Vertically Partitioned DBMS for Semanti c

Web Data Management (VLDBJ 2009)

Come Join the Yale DB Group!

More Plugs

� Blog:• http://dbmsmusings.blogspot.com/

� Twitter:• @abadid

� New Research Project:• HadoopDB (see VLDB paper)

Query Execution in Column- Oriented Database Systemsabadi/talks/abadi-sigmod-award.pdf · 0 10 20...

Documents

waste management Award.pdf

Company profile PT. Remala Abadi

1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading

Storm@Twitter, SIGMOD 2014

CR PT Samudera Mulia Abadi

Abadi. El Falso Self

Rabbi Abadi Nidah

Shams Al-haq Abadi

Anti-Combining for MapReduce...example, after entering \sig" the search engine might sug-gest\sigmod",\sigmod 2014", and\sigmod acceptance rate". For realtime suggestions, these expansions

Scalable Semantic Web Data Management Using Vertical ...abadi/talks/abadi-vldb07-slides.pdf · Scalable Semantic Web Data Management Using Vertical Partitioning Daniel Abadi 2→→→→1,

Company Profile - PT.Mulki Abadi Management

Daniel J. Abadi – Yale - New Haven, USA Samuel R. Madden – MIT – Cambrigde, USA Nabil Hachem – Avantgarde Consulting – Shrewbury, USA SIGMOD '08 Apresentado

Column-Stores vs. Row-Stores How Different are they Really? Daniel J. Abadi, Samuel Madden, and Nabil Hachem, SIGMOD 2008 Presented By, Paresh Modak(09305060)

論文紹介 M. Abadi and P.Rogaway: Reconciling …hagi.is.s.u-tokyo.ac.jp/.../kougiroku/ango/abadi-rogaway.pdfM. Abadi and P.Rogaway: Reconciling Two Views of Cryptography (The Computational

Hadoop in sigmod 2011

Dak Keraton Abadi (Ceiling Brick)

2020 Pratama Abadi Nusa Industri

CUbRIK research at SIGMOD 2012

Pengelolan lanjutan sepsis satriawan abadi

Sigmod Record Sept 11 Entire Issue