SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Preview:

Citation preview

SciDBAn Open Source Data Base

Project

by

Michael Stonebraker(and others)

OutlineOutline

Why science folks are unhappy with RDBMSHow we plan to fix thatThe details

Why SciDB?Why SciDB?

“Big science” very unhappy with RDBMS

Astronomy

HEP

Fusion

Bio

Remote sensing

Why?Why?

Experience of Sequoia 2000 (mid 1990s)Tried to use Postgres for science databasesFailed badly……

Main science data type is an array – horribly

inefficient to simulate arrays on top of tables

Required features absent (provenance, uncertainty,

version control)

SQL operations wrong (regrid – not join)

Why SciDB?Why SciDB?

Net result

Mentality of “roll your own from the ground up” for

every new science project

Realization by the science community that this is

long-term suicide

Community wants to get behind something

better

Great commonality of needs among domains

A Little ContextA Little Context

XLDB-1 Genesis of the need

Asilomar conference (March 2008)Small conference to generate requirements

A Little ContextA Little Context

March 2008 – September 2008

Initial design completed

Fund raising

Recruiting of initial team

Detailed use cases specified

Our PartnershipOur Partnership

Science and high-end commercial folks

Who will put up some resources

And review design

DBMS brain trust

Who will design the system, oversee its

construction, and perform needed research

Non-profit company

Which will manage the open source project

And support the resulting system

May need long term funding help

Partners – Science Partners – Science (We are recruiting more….)(We are recruiting more….)

LSST astronomy project

DBMS work co-ordinated by SLAC

Pacific Northwest National Laboratory (PNNL)

Various bio projects

Lawrence Livermore National Laboratory

Fusion projects

UCSB

Remote sensing

Partners -- DBMSPartners -- DBMS

Mike Stonebraker (MIT)Dave DeWitt (Wisconsin -> Microsoft)Jignesh Patel (Wisconsin)Jennifer Widom (Stanford)Dave Maier (Portland State)Stan Zdonik (Brown)Sam Madden (MIT)Ugur Cetintemal (Brown)Magda Balazinska (Washington)Mike Carey (UCI)

Partners -- OtherPartners -- Other

E-Bay VerticaMicrosoftLSSTSLACWill hit up NSF and DOE

The SciDB Data ModelThe SciDB Data Model

Nothing (e.g. Hadoop, Pig, Hive, …)?Most of you have schemas

Hadoop is not a good starting pointSlowNo HA

The SciDB Data ModelThe SciDB Data Model

Tables?Makes a few of you happyUsed by Sloan Sky Survey

ButPanStarrs (Alex Szalay) wants arrays and

scalability

The SciDB Data ModelThe SciDB Data Model

Arrays?Superset of tables (tables with a primary

key are a 1-D array)Makes HEP, remote sensing, astronomy,

oceanography folks happyBut

Not biology and chemistry (who wants

networks and sequences)

The SciDB Data ModelThe SciDB Data Model

Multidimensional gridsSuperset of arrays (non-uniform cells)Makes solid modeling folks happy

ButComplex and slow

SciDB Data ModelSciDB Data Model

Nested multidimensional arraysArray values are a tuple of values and arrays

Sightings (sid, details) [x, y, z, t]

Objects (type, [sid]) [id]

Basic ArraysBasic Arrays

Positive integer dimensions, no gapsBounded or unbounded

Enhanced ArraysEnhanced Arrays

“Shape” functionSupports irregular boundary

Enhanced ArraysEnhanced Arrays

Co-ordinate systemsUser defined functions that map integers to

something elseE.g. mercator

Use dimension notation to access, e.g.A[17,36] orA{468.2, 917.6}

SciDB Query LanguageSciDB Query Language

“Parse-tree” representation of array operationsWith a “binding” to:

MatLab

C++

Python

IDL

There may be more….

User extendable operations (Postgres-style)

OperationsOperations

Standard relational ones (filter, join)Plus whatever you want (regrid, interpolate,

fourier transform, eigenvalues, …)Plus add your own (Postgres-style)We need science input here!!!

Environment and StorageEnvironment and Storage

Extendable grid (cloud) of Linux machinesWith built-in high availability and failoverAnd built in disaster recovery

In Situ ProcessingIn Situ Processing

Operate on data with loading itSupported by a SciDB self-describing file

formatAnd some number of adaptors, e.g. HDF-5,

NetCDF Or write your own

Storage ModelStorage Model

Arrays are “chunked” in storageChunk size can vary

Chunks are partitioned across the gridGo for scalability to petabytes

Other Features Other Features Which Science Guys WantWhich Science Guys Want

(These could be in RDBMS, but Aren’t)(These could be in RDBMS, but Aren’t)

Uncertainty

Data has error bars

Which must be carried along in the computation

(interval arithmetic)

Will look at more sophisticated error models later

Other FeaturesOther Features

Provenance (lineage)

What calibration generated the data

What was the “cooking” algorithm

In general – repeatability of data derivation

Supported by a command log

with query facilities (interesting research problem)

And redo

Other FeaturesOther Features

Time travel

Don’t fix errors by overwrite

I.e. keep all of the data

Supported by an extra array dimension (history)

Spatial supportNamed versions

Recalibration usually handled this way

Supported by allocating an array for the new

version and “diffing” against its parent

Other Features Other Features

(Optionally) integration of the real time data

capture system

“cooking” inside DBMS

Makes provenance capture easier

Sometimes important

Time LineTime Line

Q4/08 start company, begin research activities

Late 2009

Demoware available

Late 2010

V1 ships

Project Organization Project Organization (Build-it for real)(Build-it for real)

CEO (Andy Palmer -- Vertica)Project management (Bobbi Heath -- Vertica)CTO (Stonebraker)

Project Organization Project Organization (Design and Research)(Design and Research)

Overall co-ordination (Stonebraker, DeWitt)Storage and execution (Madden, Cetintemal)Query layer and semantics (Zdonik, Maier)Provenance (Widom, Patel)Resource management (Balazinska)Language bindings (Carey)

SciDB Has a Good Chance at SuccessSciDB Has a Good Chance at Success

Community realizes shared infrastructure is

good“Lighthouse” customersStrong teamComputation goes inside the DBMS

Easier to shareAnd reuse

How Can You Help?How Can You Help?

Get involved!!!!

Recommended