34
Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010 DIR Edinburgh

Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

Embed Size (px)

Citation preview

Page 1: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

Scientific Databases: the story behind the scenes

Martin KerstenMilena Ivanova

M.Kersten Mar 2010 DIR Edinburgh

Page 2: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Departure for a journey

• CWI Database Architecture Group

• Core business:• To research efficient and effective database

technology• To deploy this technology in real-life application

settings• To disseminate this knowledge as open-source

software

• Key research issues• What is the ultimate (virtual) machine architecture

and software stack for database processing?

DIR Edinburgh

Page 3: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

The Big Data Bang

M.Kersten Mar 2010 DIR Edinburgh

Page 4: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Outline

• Departure for a journey• Mapping unknown territory• Crossing the Great Divide

• Stepping stone 1: Multimedia Dimension• Stepping stone 2: Geometric Dimension• Stepping stone 3: Lineage Dimension• Stepping stone 4: Heterogeneous Databases• Stepping stone 5: Semantic Search• Stepping stone 6: Wireless sensor databases• Stepping stone 7: Distributed Databases

• Arrival and outlook• SciDB and SciLens ambitions• Teaming up and making it a success

DIR Edinburgh

Page 5: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010 DIR Edinburgh

Page 6: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

SkyServer provides public access to SDSS

for astronomers, students, and wide public

A project to make a map of a large part of the

Universe

230 million object images1 million spectra4TB catalog data9TB images

DIR Edinburgh

Page 7: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

SkyServer Schema

446 columns>370 million rows

Vertical fragment of 100+ popular columns

Materialized join of Photo and Spectra

DIR Edinburgh

Page 8: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Initial exploration

DIR Edinburgh

Page 9: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Initial exploration

DIR Edinburgh

Page 10: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Mapping unknown territory

Multimedia Images

Geometric Mapping

Features Space

Annotations

Modelling (Atlas)

Astronomy Neuroscience…

Geophysics Biosciences

DIR Edinburgh

Page 11: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

One size fits all?

M.Kersten Mar 2010 DIR Edinburgh

Pic

o sc

ale

Meg

a sc

ale

Structured semi-structure documents images

OracleMS SQLserverDB2

Vertica MonetDB

PostgresqlMysql, MariaDBSQLite MongoDB

LucidDB

NoSQL

Page 12: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

We have to stand the storm

M.Kersten Mar 2010 DIR Edinburgh

Page 13: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 1: Multimedia Dimension

• Storage challenges:• Large volumes (>Tbyte, >Pbyte) of raw data• Partitioning based on image, video segmentation• Indexing based on feature vectors

• Query challenges:• Proximity and probability based search • CPU intensive, user defined predicates• Content-based information retrieval

DIR Edinburgh

Page 14: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 1: Multimedia Dimension

• The database consists of 100.000 images.• From each image we extract 25 patches• For each patch a 14-dimensional feature vector is

derived

2.500.000 images

• Challenge, find similar images based on Euclidian distance with sub-second response time.

• Solution, novel database algorithms to solve K-nearest neighbours (k-NN) search

• Lessons: start from generative models.DIR Edinburgh

Page 15: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 1: Multimedia Dimension

• Alternative scheme, determine the probability that an image can be generated with a limited number of Guassian mixtures

• Fix a limited number of GMM and use an Expectation Maximization algorithm to fit the model over the image

• Search similar images by comparison of the GMM model parameters

DIR Edinburgh

Page 16: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Probabilistic Image Dimension

• Query:

• Which of the models is most likely to generate these 24 samples?

DIR Edinburgh

Page 17: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Probabilistic Image Dimension

?

DIR Edinburgh

Page 18: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 2: Geometric Dimension

• Any geometric abstraction of reality provides a good navigational map

• Database storage and indexing support for 2D is mature• R-trees and Quad-trees• Commercial database vendors do ‘not like them’

• Open research issue is to support 2D query embedding• Scaling out towards 3-, 4-, dimensions and temporal support

• Examples: researched extensively in Geographical Information Systems. Google-map is omnipresent or openGIS

• Lessons: avoid abundance of reference models, baroque datastructures not necessarily scale

DIR Edinburgh

Page 19: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 3: Lineage Dimension

• The problem encountered in many scientific databases is to ensure data lineage, the ability to travel back in time to understand, redo and judge the derivations.

• How to keep track of the complete context?• Data, software, parameter settings,…

• How to redo part of the analysis ?• How to store and remember the lineage trails?

• Example: AstroWise project in Groningen keeps track of a complete workflow for telescope data analysis in a large Oracle database. All derivations are 5-line python programs.

• Lesson: don’t be afraid for storage cost, be an accountantDIR Edinburgh

Page 20: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 4: Heterogenous Databases

• A key problem is to share heterogeneous information• Use commonly approved vocabulary and standard

syntax• XML is the language inter-galactica for self-descriptive

data and its exchange between software systems• RDF claims to be the next king

• The database community was actively working on XML, XQuery, and Xupdate database engines, but it is not easy !• Challenges, how to scale to large XML stores ? How to

efficiently search components? How to realize structural information retrieval?

• RDF world brings in graph-algorithms

• Lessions: science is done, jewels are captured by banditsDIR Edinburgh

Page 21: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Database and Informatics Working Group

FBIRN 2005 – David Keator

MR scanner

scanner- or software-specific

file formats

XML-based events file

XML-based image header

image pre-processing

event analysis

fBIRN pipeline“big picture”

DIR Edinburgh

Page 22: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 5: Semantic search

• Ontology integration is one of the most pressing challenges for the semantic web to take off.

• Integration of technology with databases is still immature.

• RDF and OWL are the leading paradigms, SPARQL is an attempt to bridge the gap between traditional database management and semantic web technology.

• Lessons: not a technological issue, but an educational and cultural issues

• http://e-culture.multimedian.nl/demo/search

DIR Edinburgh

Page 23: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 6: Sensor Databases

• Database management functionality can be downscaled to the level of small sensor-enabled devices. They can form ad-hoq networks and provide a straightforward SQL interface for aggregation. The focus is on network based aggregation under severe energy limitations .

• Embedded database systems are not up to the job. Positive case studies include TinyDB on TinyOS (Berkeley)

• The DataCell project at CWI ( and Philips) aims to provide for a more expressive query language and application interface.

DIR Edinburgh

Page 24: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

sensor cluster

mobile

stationary

distributed

sensor net

mobilesensor cluster

integratedmanagement

distributedmanagement

Research World Perspective

PC-lesssensor net AmbientDB

Semantic Sensors

Past Future

DIR Edinburgh

Page 25: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 7: MR/DDBMS

• HPC … Grids …. Clouds …• Grids are focussed on high-performance computing with

a focus on Authentication-Authorization-Access and data shipping over wide-area networks.

• Map-reduce technology is a re-invention of re-scaled distributed database technology and distributed programming.

• Data distribution, replication, and parallel query processing is well studied over the last 3 decades !!

• Lessions: application programmers are infected by “not-written-by-me” hype bacteria

DIR Edinburgh

Page 26: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

MonetDB in the large

• MonetDB/Map-reduce• Pure map-reduce approach driven by query streams

leading to self-organising distributed database.

• MonetDB/Octopus• Dynamic partial replication of databases with

economic model for reallocation and recycler technology

• MonetDB/Datacyclotron• Let the database hotset flow like a stream or

particles through a large and fast ring-connected machines, e.g. a data collider

M.Kersten Mar 2010 DIR Edinburgh

Page 27: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

Get our hands dirty

M.Kersten Mar 2010 DIR Edinburgh

Toys

Tools&

Techniques

Page 28: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

The MonetDB product family

MonetDBkernel

MAPI protocol

JDBC

C-mapi lib

Perl

End-user application

ODBC PHP Python

SQL XQuery

RoR

Page 29: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

The MonetDB Software Stack

XQuery

MonetDB 4 MonetDB 5

MonetDB kernel

SQL 03

OptimizersGIS

SQL/XML

SOAPOpen-GIS

An advanced column-oriented DBMS

compile

DIR Edinburgh

Page 30: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

An advanced column-oriented DBMS

The MonetDB Software Stack

MonetDB 5

MonetDB kernel

SQL 03

OptimizersExtensions

Orthogonal extension of SQL03

Clear computational semantics

Minimal extension to MonetDB

Page 31: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

30/06/2009 SIGMOD'09 Providence, RI

An Architecture for Recycling Intermediates M. Ivanova, M. L.

Kersten, N. Nes, R. Goncalves

32/20

Run-time Support

Recycler Optimizer

MonetDB Recycler Architecture

SQL

MonetDB Server

Tactical Optimizer

MonetDB Kernel

XQuery

MAL

MAL

Recycle Pool

function user.s1_2(A0:date, ...):void; X5 := sql.bind("sys","lineitem",...); X10 := algebra.select(X5,A0); X12 := sql.bindIdx("sys","lineitem",...); X15 := algebra.join(X10,X12); X25 := mtime.addmonths(A1,A2); ...

function user.s1_2(A0:date, ...):void; X5 := sql.bind("sys","lineitem",...); X10 := algebra.select(X5,A0); X12 := sql.bindIdx("sys","lineitem",...); X15 := algebra.join(X10,X12); X25 := mtime.addmonths(A1,A2); ...

Admission & Eviction

Page 32: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

SciDB and SciLens projects

• Design and implement a database management system better geared at the requirements of scientific applications

• SciDB vision (http://www.scidb.org)• Array datamodel is missing• Distributed, map-reduce processing from the start• No-cost loading of data• … redo all the hard work from the ground up

• SciLens • Multi-paradigm software layer • Database summarisation is the key• … build on the shoulders of the MonetDB team

M.Kersten Mar 2010 DIR Edinburgh

Page 33: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010

Teaming up and making it a success

Crossing the Great Divide is challenging and rewarding iff• Building the bridge starts from both ends• Parties recognize and respect each others core business

Open-source database technology provides a sound basis to manage sizeable scientific databases• To capitalize and steer expertise development

The database community can provide knowledge on modelling, query processing, algorithms, data structures, scalability, persistency, …and flexible database systems

The MonetDB team seeks new frontiers in scalable structured database management

DIR Edinburgh

Page 34: Scientific Databases: the story behind the scenes Martin Kersten Milena Ivanova M.Kersten Mar 2010DIR Edinburgh

M.Kersten Mar 2010 DIR Edinburgh