13
Next-Generation Databases Miguel Branco on behalf of the RAW team

Next-Generation Databases Miguel Branco on behalf of the RAW team

Embed Size (px)

Citation preview

Page 1: Next-Generation Databases Miguel Branco on behalf of the RAW team

Next-Generation Databases

Miguel Branco on behalf of the RAW team

Page 2: Next-Generation Databases Miguel Branco on behalf of the RAW team

2

Trends• More complex hardware

– Multicores, GPUs, Cloud, NUMA*, PoP+SoC**, …

• More complex questions– “Last month sales” “Next month sales”

• More complex apps– Distributed, Service-oriented, Rack-aware, ...

• More data analysts– Easy-of-use, Interactivity, Collaboration, ..

• More data– Volume, File Formats, ... * Non-uniform memory architectures

** Package on Package, System on a Chip

Page 3: Next-Generation Databases Miguel Branco on behalf of the RAW team

3

Page 4: Next-Generation Databases Miguel Branco on behalf of the RAW team

4

• No data loading– No “physical” data copy: support existing file formats

• No database tuning– Instead, self-tuned based on actual usage patterns

• Not restricted to tables– Add support for trees, vectors, matrices, …

• Not just SQL– Instead, enable domain-specific languages

Page 5: Next-Generation Databases Miguel Branco on behalf of the RAW team

5

Traditional Database

Data adapts to the query engine

DBMS

SQL

CSV XML JSON

Page 6: Next-Generation Databases Miguel Branco on behalf of the RAW team

6

RAW

Query engine adapts to the data

DBMS

SQL

CSV XML JSON

RAW lang

“DSL”

Page 7: Next-Generation Databases Miguel Branco on behalf of the RAW team

How RAW adapts to data

CSVROOT

join

scanroot

scancsvfilter

… containing“good” run numbers … containing

physics events

Code Generate the Access Paths

Code Generate the Query

Build Position and Data Caches

SELECT event.jet…FROM csv, rootWHERE csv.RunNumber = root.RunNumber AND root. EF_2mu13 == TRUE AND …

Adapt to format, file instance and query just-in-time

Page 8: Next-Generation Databases Miguel Branco on behalf of the RAW team

8

Adapting to schema & query

[CSV input]col: if col needed: if col isInt

readInt(); if col isFloat

readFloat(); if ... else: skipField();

GENERAL-PURPOSE

readInt();readInt();skipField();readFloat();skipRestLine();

JUST-IN-TIME

Remove overhead of generic operators

Page 9: Next-Generation Databases Miguel Branco on behalf of the RAW team

9

Adapting to format• Unroll Columns

• Free navigation in files

• Embedded indexes/existing APIs

col:if col needed: if col isInt ...

readInt();skipField();readFloat();skipRest();

- fieldLength:10- tupleLength:100- Need fields 2 & 5

of 2nd row

moveTo(110);readInt();moveTo(140);readFloat();- Bitmaps, R-Trees etc.

- readNextField() vs. readField(filename,id)

Page 10: Next-Generation Databases Miguel Branco on behalf of the RAW team

11

ElectroneventID INTeta FLOATpt FLOAT

JeteventID INT

eta FLOAT

pt FLOATEvent

eventID INT

runNumber INT

MuoneventID INT

eta FLOAT

pt FLOAT

ROOT - C++ RAWclass Event {

class Muon {float pt, eta;…

} class Electron {

float pt, eta;…

} class Jet {

float pt, eta;…

} int runNumber; vector<Muon> muons; vector<Electron> electrons; vector<Jet> jets; }

HEP analysis: Data

Page 11: Next-Generation Databases Miguel Branco on behalf of the RAW team

12

HEP analysis: Queries“Identify events of interest → Filter out background events

→ Plot aggregated results in a histogram”

SELECT event FROM root:/data1/ATLAS/*.root , csv:/data1/ATLAS/events.csv WHERE ( csv.id = event.id AND event.EF_e24vhi_medium1 OR

event.EF_e60_medium1 OR event.EF_2e12Tvh_loose1 OR

event.EF_mu24i_tight OR event.EF_mu36_tight OR event.EF_2mu13) AND event.muon.mu_ptcone20 < 0.1 *

event.muon.mu_pt AND event.muon.mu_pt > 20000. AND ABS(event.muon.mu_eta) < 2.4 AND …..

1000+ lines of C++for (unsigned int imuon = 0 ; imuon<((*curr_entries)[jentry].mu_pt)->size(); imuon++) { if (((*curr_entries)[jentry].

mu_ptcone20)->at(imuon) < 0.1 * ((*curr_entries)[jentry].mu_pt)->at(imuon) &&

((*curr_entries)[jentry].mu_pt)->at(imuon) > 20000. &&

fabs(((*curr_entries)[jentry].mu_eta)->at(imuon)) < 2.4 &&

…}...

ROOT - C++ RAW

Page 12: Next-Generation Databases Miguel Branco on behalf of the RAW team

13

Query 1 (Cold)

Query 2 Query 3 Query 4 Query 5 Query 6100

1000

10000

100000

1000000

10000000

RAW ROOT

Exec

ution

Tim

e (s

ec)

RAW vs. the ROOT framework[Xeon CPU E7-28867 @ 2.13GHz1TB HDD - 7200RPM,192GB RAM]

ROOT: 900 GB in 127 files

CSV: 1 “table” of IDs

Declarative queries + up to 90x improvement

Page 13: Next-Generation Databases Miguel Branco on behalf of the RAW team

14

RAW for High-Energy Physics

• End-users:– Performance (JIT, codegen, vectorwise, …)– Easy-to-use (declarative) query language

• Infrastructure Providers: – Data kept in original location & file format– Declarative query language More optimization opportunities

• “Event” caches

http://dias.epfl.ch/RAWThank You!