Upload
eagan
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A Metadata Binding Store for Distributed Scientific Data. Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009. UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009. MOTIVATION. Scientific data/metadata are generated at great speed and high volume. - PowerPoint PPT Presentation
Citation preview
A Metadata Binding Store for Distributed Scientific Data
Yin Chen, Malcolm Atkinson, Stuart Aitken Dec. 2009
UK e-Science All Hands Meeting 2009, Oxford, 08 Dec. 2009
MOTIVATION Scientific data/metadata are generated at great
speed and high volume
Data and Metadata are often created independently
Hypothesis: A binding service is useful to serve various scales distributed scientific data
Metadata are the key to data access, discovery, preservation, provenance, interpretation
We view the relationship between data and metadata as a binding
EurExpress Project, EU funded under FP6, 2005-2009.
Aim to capture >20,000 gene via RNA in situ hybridization (ISH).
Generate digital ‘transcriptome atlas’
IS BINDING A PROBLEM?
Nov.2009: 19,411 assay, 15,715 annotations, ~5TB data
Gene Expression Data Repository
Metadata
Template Images
Annotation(FIATAS)Alicante
Genepaint Robotics
14.5 days
mouse embryo
Section slides Automatic
ISHs(8 EU Bio
labs)
High resolution
gene express imagesISH
management(LIME system)
REAL WORLD OBSERVATIONS Information
inconsistency
1) The Numbers of probe genes miss-matched with the template design
2) The Numbers of gene expression images without metadata
Significant human operating errors
Consistency checking became more difficult as data increased
The bindings have to be efficiently managed!
DESIGN PRINCIPLES A binding system manages bindings
Generic approach, independent from data resources
Federate references of data and metadata Data warehousing approach is no longer feasible Data become too large, too dynamic, too unwieldy to copy No permit to copy Refreshness
Allow binding sharing among user communities, scalable
Can be combined with other services
Design principle: Simple Minimize internal complexity: no conflict Maximize external integrity: less overlap
A SIMPLE BINDING STORE Binding Data Model
Binding ID – UUID, need no central registration authority, unlimited
Binding subject/object – URIs, used by most web accessible data resources
Binding description – Tags, efficient, flexible
Binding APIs Manipulation operations Discovery operations Delivery operations
IMPLEMENTATION Grid tech. OGSA-DAI
OGSA-DAI server activities OGSA-DAI client activities OGSA-DAI client toolkits
Service Proxy APIs, programmable interface for users
Command-line UI
Not included in current work
EVALUATION Use workload modelling and simulation
method No available binding data Observations from wwwPDB, BADC,
EurExpress, NanoCMOS, Flickr Creation patterns, access patterns, and
content patterns are observed Simulation of the real-world observations
WORKLOAD MODELLING
Creation Workloads
Num
ber o
f Ann
otat
ion
per d
ay
New
PD
B S
truc
ture
per
Mon
th
Num
ber o
f Dat
a Fi
le p
er d
ay
Access Workloads
Num
ber o
f Acc
ess
per d
ay
Tag Behaviours
WORKLOAD SIMULATION
Zipf’s Dist. α=0.9 Zipf’s Dist. α=0.4
Hidden Markov ModelTwo Poisson Processes, Two Uniform Dist.
Uniform Dist.: Trend:
Zipf’s Dist. α=0.2 Weibull Dist.
Poisson Process:
Prob
abili
ty o
f the
inte
rval
s oc
curr
ence
EXPERIMENT SETUP
Inter(R) Core2 2.66GHz, RAM 7GB, 144GB HD, 100Mbps network conn, Red Hat 4.1, Tomcat 5.5, OD 3.1, MySQL 6.0, R 2.9.
SSJ, Colt, benchmark script 10 runs per configuration, collected Means, SEs, 95%
CIs
EXPERIMENT RESULTS Robust to different types of workloads Robust to small ~ large scale workloads Robust to both independent and combined workloads Stressed by the Ultra scale workloads
FUTURE WORK A Scalable Binding Store
Cloud Computing promises to be scalable
Our Evaluation of the Hadoop
BINDING APPLICATIONS Web move to web3.0 Binding index Combine with metadata management tools Mashup applications
ACKNOWLEDGEMENT National e-Science Center, research group, support team,
middleware team MRC HGU Biomedical Statistical Analyse Section: Prof Richard
Baldock, Dr Duncan Davidson Newcastle HDBR: Prof Susan Lindsay, Steven N. Lisgo EDINA Geo Research & Data Library: Chris Higgins, Dr David
Medyckyj-Scott Data resourses: DGEMap, EurExpress Prof Richard Baldock, Lalit
Kumar, NanoCMOS Dr Clive Davenhall, Prof Richard Sinnott Technique support: OGSA-DAI team Research materials: COBrA-CT, OntoGrid Prof Carole Goble, Dr
Oscar Corcho, MyGrid Dr Phillip Lord