15
A Metadata Binding Store for Distributed Scientific Data Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009 UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009

A Metadata Binding Store for Distributed Scientific Data

  • Upload
    eagan

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

A Metadata Binding Store for Distributed Scientific Data. Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009. UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009. MOTIVATION. Scientific data/metadata are generated at great speed and high volume. - PowerPoint PPT Presentation

Citation preview

Page 1: A Metadata Binding Store for Distributed Scientific Data

A Metadata Binding Store for Distributed Scientific Data

Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009

UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009

Page 2: A Metadata Binding Store for Distributed Scientific Data

MOTIVATION Scientific data/metadata are generated at great

speed and high volume

Data and Metadata are often created independently

Hypothesis: A binding service is useful to serve various scales distributed scientific data

Metadata are the key to data access, discovery, preservation, provenance, interpretation

We view the relationship between data and metadata as a binding

Page 3: A Metadata Binding Store for Distributed Scientific Data

EurExpress Project, EU funded under FP6, 2005-2009.

Aim to capture >20,000 gene via RNA in situ hybridization (ISH).

Generate digital ‘transcriptome atlas’

IS BINDING A PROBLEM?

Nov.2009: 19,411 assay, 15,715 annotations, ~5TB data

Gene Expression Data Repository

Metadata

Template Images

Annotation(FIATAS)Alicante

Genepaint Robotics

14.5 days

mouse embryo

Section slides Automatic

ISHs(8 EU Bio

labs)

High resolution

gene express imagesISH

management(LIME system)

Page 4: A Metadata Binding Store for Distributed Scientific Data

REAL WORLD OBSERVATIONS Information

inconsistency

1) The Numbers of probe genes miss-matched with the template design

2) The Numbers of gene expression images without metadata

Significant human operating errors

Consistency checking became more difficult as data increased

The bindings have to be efficiently managed!

Page 5: A Metadata Binding Store for Distributed Scientific Data

DESIGN PRINCIPLES A binding system manages bindings

Generic approach, independent from data resources

Federate references of data and metadata Data warehousing approach is no longer feasible Data become too large, too dynamic, too unwieldy to copy No permit to copy Refreshness

Allow binding sharing among user communities, scalable

Can be combined with other services

Design principle: Simple Minimize internal complexity: no conflict Maximize external integrity: less overlap

Page 6: A Metadata Binding Store for Distributed Scientific Data

A SIMPLE BINDING STORE Binding Data Model

Binding ID – UUID, need no central registration authority, unlimited

Binding subject/object – URIs, used by most web accessible data resources

Binding description – Tags, efficient, flexible

Binding APIs Manipulation operations Discovery operations Delivery operations

Page 7: A Metadata Binding Store for Distributed Scientific Data

IMPLEMENTATION Grid tech. OGSA-DAI

OGSA-DAI server activities OGSA-DAI client activities OGSA-DAI client toolkits

Service Proxy APIs, programmable interface for users

Command-line UI

Not included in current work

Page 8: A Metadata Binding Store for Distributed Scientific Data

EVALUATION Use workload modelling and simulation

method No available binding data Observations from wwwPDB, BADC,

EurExpress, NanoCMOS, Flickr Creation patterns, access patterns, and

content patterns are observed Simulation of the real-world observations

Page 9: A Metadata Binding Store for Distributed Scientific Data

WORKLOAD MODELLING

Creation Workloads

Num

ber o

f Ann

otat

ion

per d

ay

New

PD

B S

truc

ture

per

Mon

th

Num

ber o

f Dat

a Fi

le p

er d

ay

Access Workloads

Num

ber o

f Acc

ess

per d

ay

Tag Behaviours

Page 10: A Metadata Binding Store for Distributed Scientific Data

WORKLOAD SIMULATION

Zipf’s Dist. α=0.9 Zipf’s Dist. α=0.4

Hidden Markov ModelTwo Poisson Processes, Two Uniform Dist.

Uniform Dist.: Trend:

Zipf’s Dist. α=0.2 Weibull Dist.

Poisson Process:

Prob

abili

ty o

f the

inte

rval

s oc

curr

ence

Page 11: A Metadata Binding Store for Distributed Scientific Data

EXPERIMENT SETUP

Inter(R) Core2 2.66GHz, RAM 7GB, 144GB HD, 100Mbps network conn, Red Hat 4.1, Tomcat 5.5, OD 3.1, MySQL 6.0, R 2.9.

SSJ, Colt, benchmark script 10 runs per configuration, collected Means, SEs, 95%

CIs

Page 12: A Metadata Binding Store for Distributed Scientific Data

EXPERIMENT RESULTS Robust to different types of workloads Robust to small ~ large scale workloads Robust to both independent and combined workloads Stressed by the Ultra scale workloads

Page 13: A Metadata Binding Store for Distributed Scientific Data

FUTURE WORK A Scalable Binding Store

Cloud Computing promises to be scalable

Our Evaluation of the Hadoop

Page 14: A Metadata Binding Store for Distributed Scientific Data

BINDING APPLICATIONS Web move to web3.0 Binding index Combine with metadata management tools Mashup applications

Page 15: A Metadata Binding Store for Distributed Scientific Data

ACKNOWLEDGEMENT National e-Science Center, research group, support team,

middleware team MRC HGU Biomedical Statistical Analyse Section: Prof Richard

Baldock, Dr Duncan Davidson Newcastle HDBR: Prof Susan Lindsay, Steven N. Lisgo EDINA Geo Research & Data Library: Chris Higgins, Dr David

Medyckyj-Scott Data resourses: DGEMap, EurExpress Prof Richard Baldock, Lalit

Kumar, NanoCMOS Dr Clive Davenhall, Prof Richard Sinnott Technique support: OGSA-DAI team Research materials: COBrA-CT, OntoGrid Prof Carole Goble, Dr

Oscar Corcho, MyGrid Dr Phillip Lord