SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDBAn Open Source Data Base

Project

Michael Stonebraker(and others)

OutlineOutline

Why science folks are unhappy with RDBMSHow we plan to fix thatThe details

Why SciDB?Why SciDB?

“Big science” very unhappy with RDBMS

Astronomy

Fusion

Remote sensing

Why?Why?

Experience of Sequoia 2000 (mid 1990s)Tried to use Postgres for science databasesFailed badly……

Main science data type is an array – horribly

inefficient to simulate arrays on top of tables

Required features absent (provenance, uncertainty,

version control)

SQL operations wrong (regrid – not join)

Why SciDB?Why SciDB?

Net result

Mentality of “roll your own from the ground up” for

every new science project

Realization by the science community that this is

long-term suicide

Community wants to get behind something

better

Great commonality of needs among domains

A Little ContextA Little Context

XLDB-1 Genesis of the need

Asilomar conference (March 2008)Small conference to generate requirements

A Little ContextA Little Context

March 2008 – September 2008

Initial design completed

Fund raising

Recruiting of initial team

Detailed use cases specified

Our PartnershipOur Partnership

Science and high-end commercial folks

Who will put up some resources

And review design

DBMS brain trust

Who will design the system, oversee its

construction, and perform needed research

Non-profit company

Which will manage the open source project

And support the resulting system

May need long term funding help

Partners – Science Partners – Science (We are recruiting more….)(We are recruiting more….)

LSST astronomy project

DBMS work co-ordinated by SLAC

Pacific Northwest National Laboratory (PNNL)

Various bio projects

Lawrence Livermore National Laboratory

Fusion projects

Remote sensing

Partners -- DBMSPartners -- DBMS

Mike Stonebraker (MIT)Dave DeWitt (Wisconsin -> Microsoft)Jignesh Patel (Wisconsin)Jennifer Widom (Stanford)Dave Maier (Portland State)Stan Zdonik (Brown)Sam Madden (MIT)Ugur Cetintemal (Brown)Magda Balazinska (Washington)Mike Carey (UCI)

Partners -- OtherPartners -- Other

E-Bay VerticaMicrosoftLSSTSLACWill hit up NSF and DOE

The SciDB Data ModelThe SciDB Data Model

Nothing (e.g. Hadoop, Pig, Hive, …)?Most of you have schemas

Hadoop is not a good starting pointSlowNo HA

Tables?Makes a few of you happyUsed by Sloan Sky Survey

ButPanStarrs (Alex Szalay) wants arrays and

scalability

Arrays?Superset of tables (tables with a primary

key are a 1-D array)Makes HEP, remote sensing, astronomy,

oceanography folks happyBut

Not biology and chemistry (who wants

networks and sequences)

Multidimensional gridsSuperset of arrays (non-uniform cells)Makes solid modeling folks happy

ButComplex and slow

SciDB Data ModelSciDB Data Model

Nested multidimensional arraysArray values are a tuple of values and arrays

Sightings (sid, details) [x, y, z, t]

Objects (type, [sid]) [id]

Basic ArraysBasic Arrays

Positive integer dimensions, no gapsBounded or unbounded

Enhanced ArraysEnhanced Arrays

“Shape” functionSupports irregular boundary

Enhanced ArraysEnhanced Arrays

Co-ordinate systemsUser defined functions that map integers to

something elseE.g. mercator

Use dimension notation to access, e.g.A[17,36] orA{468.2, 917.6}

SciDB Query LanguageSciDB Query Language

“Parse-tree” representation of array operationsWith a “binding” to:

MatLab

Python

There may be more….

User extendable operations (Postgres-style)

OperationsOperations

Standard relational ones (filter, join)Plus whatever you want (regrid, interpolate,

fourier transform, eigenvalues, …)Plus add your own (Postgres-style)We need science input here!!!

Environment and StorageEnvironment and Storage

Extendable grid (cloud) of Linux machinesWith built-in high availability and failoverAnd built in disaster recovery

In Situ ProcessingIn Situ Processing

Operate on data with loading itSupported by a SciDB self-describing file

formatAnd some number of adaptors, e.g. HDF-5,

NetCDF Or write your own

Storage ModelStorage Model

Arrays are “chunked” in storageChunk size can vary

Chunks are partitioned across the gridGo for scalability to petabytes

Other Features Other Features Which Science Guys WantWhich Science Guys Want

(These could be in RDBMS, but Aren’t)(These could be in RDBMS, but Aren’t)

Uncertainty

Data has error bars

Which must be carried along in the computation

(interval arithmetic)

Will look at more sophisticated error models later

Other FeaturesOther Features

Provenance (lineage)

What calibration generated the data

What was the “cooking” algorithm

In general – repeatability of data derivation

Supported by a command log

with query facilities (interesting research problem)

And redo

Other FeaturesOther Features

Time travel

Don’t fix errors by overwrite

I.e. keep all of the data

Supported by an extra array dimension (history)

Spatial supportNamed versions

Recalibration usually handled this way

Supported by allocating an array for the new

version and “diffing” against its parent

Other Features Other Features

(Optionally) integration of the real time data

capture system

“cooking” inside DBMS

Makes provenance capture easier

Sometimes important

Time LineTime Line

Q4/08 start company, begin research activities

Late 2009

Demoware available

Late 2010

V1 ships

Project Organization Project Organization (Build-it for real)(Build-it for real)

CEO (Andy Palmer -- Vertica)Project management (Bobbi Heath -- Vertica)CTO (Stonebraker)

Project Organization Project Organization (Design and Research)(Design and Research)

Overall co-ordination (Stonebraker, DeWitt)Storage and execution (Madden, Cetintemal)Query layer and semantics (Zdonik, Maier)Provenance (Widom, Patel)Resource management (Balazinska)Language bindings (Carey)

SciDB Has a Good Chance at SuccessSciDB Has a Good Chance at Success

Community realizes shared infrastructure is

good“Lighthouse” customersStrong teamComputation goes inside the DBMS

Easier to shareAnd reuse

How Can You Help?How Can You Help?

Get involved!!!!

SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

Documents

Time Travel in a Scientific Array Database - SciDB - University of

SCIDB BASED FRAMEWORK FOR STORAGE AND ANALYSIS OF …€¦ · SciDB array is created by specifying its dimensions and attributes of the array. For example a 3-dimensional SciDB array

1 So What To Do Next? Michael Stonebraker Adjunct Professor Massachusetts Institute of Technology (stonebraker@lcs.mit.edu)

What to do with Scientific Data? by Michael Stonebraker

Large-Scale Linear Algebra with Rillposed.net/boston_r_meetup_2012.pdf · SciDB and R-SciDB is an open-source, array-oriented, parallel, distributed database.-It knows about matrix

Асинхронная репликация MySQL без цензурыЗнакомимся • Писал OLAP, OLTP и NoSQL базы данных (QD, MySQL, SciDB) • Помогаю

SciDB A DBMS for Analytic Applications by Michael Stonebraker · Building SciDB e-Bay (partial FTE) LSST (2 FTE working on project) Persistence Software (committed 3 FTE) M.I.T (1

Inclusion of New Types in Relational Database Systems Michael Stonebraker

SciDB User's Guide - National Energy Research … User's Guide iv 5.4. Parallel Load ..... 36 5.5. Saving Data from a SciDB Array to a File ..... 36 ... architecture including distributed

Implementing Connected Component Labeling as a User ... · SciDB SciDB is an all-in-one data management and advanced analytics platform that features: • Complex analytics inside

IR 003 035 Stonebraker, Michael; And Others The Design and … · 2014. 1. 27. · This program has the effect of embeddin7. all. of INGRES. in the general purpose programmin7 language

VoltDB - Stonebraker Live! - New York City 2013

Tackling Data Curation in Three Generations Mike Stonebraker

Environment for Datasets Processing and Visualization ...fredh/papers/conf/150-efdpavus/paper.pdf · Michael Stonebraker and his team created SciDB to manage big data. The initial

BioCompute & SciDB a pipeline-in-a-database · SciDB & BioCompute Objects SciDB • Data loaders enforce type & field constraints • Arrays are versioned and time-stamped • Database

Tamr | Strata hadoop 2014 Michael Stonebraker

A Sketch of Regres Mike Carey Joey Hellerstein Michael Stonebraker

What to do with Scientific Data? Michael Stonebraker

Michael Stonebraker How to do Complex Analytics

AN OPEN -ACCESS HIGH PERFORMANCE COMPUTING SYSTEM … · • SciDB • Integration • R • Interface • R/Shiny • SciDB Capabilities • CROSS_JOIN: Combine two arrays, aligning