Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin...

Efficient and Flexible Information Retrieval Using

MonetDB/X100

Sándor HémanCWI, Amsterdam

Marcin Zukowski, Arjen de Vries, Peter BonczJanuary 08, 2007

Background

Process query-intensive workloads over large datasets efficiently within a DBMS

Application Areas Information Retrieval Data mining Scientific data analysis

MonetDB/X100 Highlights

Vectorized query engine Transparent, light-weight compression

Keyword Search

Inverted index: TD(termid, docid, score)

TopN( Project( MergeJoin( RangeSelect( TD1=TD, TD1.termid=10 ), RangeSelect( TD2=TD, TD2.termid=42 ), TD1.docid = TD2.docid), [docid = TD1.docid, score = TD1.scoreQ + TD2.scoreQ]), [score DESC], 20)

Keyword Search

Vectorized Execution [CIDR05]

Volcano based iterator pipeline

Each next() call returns collection of column-vectors of tuples Amortize overheads Introduce parallelism Stay in CPU Cache

Vectors

Light-Weight Compression

Compressed buffer-manager pages: Increase I/O bandwidth Increase BM capacity

Favor speed over compression ratio CPU-efficient algorithms

>1 GB/s decompression speed Minimize main-memory overhead

RAM-CPU Cache decompression

Naïve Decompression1. Read and

decompress page

2. Write back to RAM

3. Read for processing

RAM-Cache Decompression1. Read and

decompress page at vector granularity, on-demand

2006 TREC TeraByte Track X100 compared to custom IR systems

Others prune index

System #CPUs P@20 Throughput (q/s)

Throughput /CPU

X100 16 0.47 186 13

X100 1 0.47 13 13

Wumpus 1 0.41 77 77

MPI 2 0.43 34 17

Melbourne Univ 1 0.49 18 18

Thanks!

MonetDB/X100 in Action

Corpus: 25M text documents, 427GB docid + score: 28GB, 9GB compressed

Hardware: 3GHz Intel Xeon 4GB RAM 10 disk RAID, 350 MB/s

MonetDB/X100 [CIDR’05]

Vector-at-a-time instead of tuple-at-a-time Volcano

Vector = Array of Values (100-1000)

Vectorized Primitives• Array Computations • Loop Pipelinable very fast• Less Function call overhead

Vectors are Cache Resident

RAM considered secondary storage

RAM considered secondary storagedecompress

RAM considered secondary storage

decompress

Vector Size vs Execution Time

Compression docid: PFOR-DELTA

Encode deltas as a b-bit offset from an arbitrary base value:

deltas withinget encoded

deltas outside range are stored as uncompressed exceptions

score: Okapi -> quantize -> PFOR compress

)2,[ bbasebase

Compressed Block Layout Forward growing

section of bit-packed b-bit code words

Compressed Block Layout Forward growing

section of bit-packed b-bit code words

Backwards growing exception list

Naïve Decompression Mark ( ) exception

positions

for(i=0; i < n; i++) { if (in[i] == ) { out[i] = exc[--j] } else { out[i]=DECODE(in[i]) }}

Patched Decompression Link exceptions into

patch-list Decode:

for(i=0; i < n; i++) { out[i]=DECODE(in[i]);}

patch-list Decode:

Patch:for(i=first_exc; i<n; i += in[i]) { out[i] = exc[--j];}

patch-list Decode:

Patch:for(i=first_exc; i<n; i += in[i]) { out[i] = exc[--j];}

Patch Bandwidth

Efficient and Flexible Information Retrieval Using MonetDB/X100 Sándor Héman CWI, Amsterdam Marcin...

Documents

Benchmarking Graph Data Management Systemshomepages.cwi.nl/~boncz/snb-challenge/graph... · Benchmarking Graph Data Management Systems EDBT Summer School 2015 Peter Boncz ... –

Tager-Flusberg & Zukowski (2009) 147-173

class 5 column stores 2daslab.seas.harvard.edu/classes/cs165/doc/class_slides/... · 2017. 9. 6. · MonetDB/X100: Hyper-Pipelining Query Execution Peter A. Boncz, Marcin Zukowski,

The MonetDB Architecture Martin Kersten CWI Amsterdam

exploiting Emergent Schemas to make RDF systems more efficienthomepages.cwi.nl/~boncz/ISWC2016-boncz-pham.pdf · Emergent Schemas •Detect the “DB-schema” in RDF data automatically!

ADT 2010 XQuery Updates in MonetDB/XQuery - CWIhomepages.cwi.nl/~manegold/teaching/adt/lectures/... · 13 Stefan.Manegold@CWI.nl MonetDB/XQuery: Updates ADT 2010 XML Storage Revisited

MonetDB: A column-oriented DBMS Ryan Johnson CSC2531

Large-Scale Data Engineering 2.pdf · “MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05 . event.cwi.nl/lsde2015 : SCAN : SELECT . PROJECT . next()

ADT 2010 XQuery Updates in MonetDB/XQuery Other Approaches ...homepages.cwi.nl/~manegold/teaching/adt/lectures/... · 1 Stefan.Manegold@CWI.nl MonetDB/XQuery: Updates ADT 2010 ADT

MonetDB/XQuery Reference Manual · PDF file1 This is the reference manual of MonetDB/XQuery, and open-source XQuery database system built on: the open-source MonetDB column-store,

MonetDB :column-store approach in database

MonetDB Server Reference Manual

Breaking the Memory Wall in MonetDB

Just-in-time compilation in MonetDB with Weld

A SPARQL front-end for MonetDB - CWI

Chapter 5 Query Execution Pipelined - uni-tuebingen.de · 2018. 3. 9. · MonetDB/X100 MonetDB/X100, developed at CWI, Amsterdam. Principal architect is Peter Boncz. MonetDB/X100

XQuery Updates in MonetDB/XQuery

Large-Scale Statistics with MonetDB and R

D5.5.1 Dissemination Report Y1ldbcouncil.org/sites/default/files/LDBC_D5.5.1.pdf · 22.09.2013 0.1 Peter Boncz Initial draft 23.09.2013 1.0 Peter Boncz Final version after reviewer

MonetDB/XQuery Technology Preview 1 Stefan Manegold CWI Amsterdam -