Cloud Computing Lecture Column Store – alternative organization for big relational data

Cloud Computing Lecture

Column Store – alternative organization for big relational data

C-store C-store is Read-optimized, for OLAP

type apps Traditional DBMS, write-optimized

(optimized for online transactions) Based on records(rows)

C-Store What are the cost-sensitive major

factors in query processing? Size of database Index or not Join

Current hardware configuration and what a DBMS can do… Cheap storage – allow distributed redundant

data store Fast CPUs – compression/decompression Limited disk bandwidth – reduce I/O

C-store Supporting OLAP (online analytic

processing) operations Optimized read operations Balanced write performance Address the conflict between writes and reads

Fast write – append records Fast read – indexed, compressed

Think if data organized in columns, what are the

unique challenges (different from the row-organization)?

C-store’s features Column based store saves space

Compression is possible Index size is smaller

Multiple projections Allow multiple indices Parallel processing on the same attributes Materialized join results

Separation of writeable store and read-optimized store Both write/read are optimized Transactions are not blocked by write locks

Data model Same as relational data model

Tables, rows, columns Primary keys and foreign keys Projections

From single table Multiple joined tables

Example

EMP1 (name, age)EMP2 (dept, age, DEPT.floor)EMP3 (name, salary)DEPT1(dname, floor)

EMP(name, age, dept, salary)DEPT(dname, floor)

Normal relational model Possible C-store model

Physical projection organization Sort key

each projection has one Rows are ordered by sort key Partitioned by key range

Linking columns in the same projection Storage key – (segment id, key, i.e.,offset in

segment)

Linking projections To reconstruct a table Join index

Conceptual organization

column

Segment:by sort keyrange

Sort key column

Seg id offset

Join index

Projection 1

Projection 2

Architectural consideration between writes and reads

Read often needs indices to speedup

Write often index unfriendly: needs to update indices frequently

Use “read store” and “write store”

Read store: Column encoding Use compression schemes and indices

Self-order (key), few distinct values (value, position, # items) Indexed by clustered B-tree

Foreign-order (non-key), few distinct values (value, bitmap index) B-tree index: position values

Self-order, many distinct values Delta from the previous value B-tree index

Foreign-order, many distinct values Unencoded

Write Store Same structure, but explicitly use

(segment, key) to identify records Easier to maintain the mapping Only concerns the inserted records

Tuple mover Copies batch of records to RS

Delete record Mark it on RS Purged by tuple mover

Tuple mover Moves records in WS to RS Happens between read-only

transactions Use merge-out process

How to solve read/write conflict Situation: one transaction updates the

record X, while another transaction reads X.

Use snapshot isolation

Benefits in query processing Selection – has more indices to use Projection – some “projections”

already defined Join – some projections are

materialized joins Aggregations – works on required

columns only

Evaluation Use TPC-H – decision support queries Storage

Query performance

Query performance Row store uses materialized views

Summary: the performance gain Column representation – avoids reads

of unused attributes Storing overlapping projections –

multiple orderings of a column, more choices for query optimization

Compression of data – more orderings of a column in the same amount of space

Query operators operate on compressed representation

Documents

Cloud Computing Lecture Column Store – alternative organization for big relational data