Upload
abraham-barton
View
219
Download
1
Tags:
Embed Size (px)
Citation preview
Discovery Net : A UK e-Science Pilot Project
for Grid-based Knowledge Discovery Services
Patrick WendelImperial College, London
Data Mining and Exploration Middleware for Distributed and Grid
Computing,
September 18-19, 2003
Why Discovery Net?
Data Challenge: Distributed, heterogeneous & large scale data sets
Novel and real-time data sourcesResource Challenge
Novel specialised data analysis components/services continually being published/made available
Computational resources provided
Information Challenge: Data cleaning, normalisation & calibration
New data needs to be related to existing data
Knowledge Challenge:Collaborative, interactive & people-intensive
Result interpretation & validation in relation to existing knowledge
Knowledge sharing is key
What is Discovery Net
Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services
•Key Technologies:• Grid and Distributed Computing• Workflow and service composition• Data Mining & Visualisation.• Data Access & Information Structuring.• High Throughput Screening Devices: real-time.
Discovery Net: Unifying the World’s Knowledge
Data Integration: Dynamic Real Time Construction of “Data Grids”
Application Integration: Component and Service-based Integration
People Integration:Global-wide Discovery Groupware
Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work
Using Distributed
Resources
What is Discovery Net
Real Time Integration
Dynamic Application
Integration
Workflow Construction
Interactive Visual
Analysis
Discovery Net Layer Model(Life Science Application)
High Performance
and Grid-enabled
Transfer Protocol
(GSI-FTP, DSTP..)
Grid-enabled
Infrastructure
(GSI)
Deployment
Web/Grid Services
OGSA
D-Net Clients:End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities
D-Net Middleware:
Provides execution logic for distributed knowledge discovery and access to distributed resources
Computation & Data Resources:
Distributed databases, compute servers and scientific devices.
A Knowledge Grid based on D-Net Servers
DNet Server
Da
ta a
ccess &
Sto
rag
e
Info
Grid
Co
mp
on
en
ts
Co
mp
uta
tion
De
plo
yme
nt
DN
et A
PI
DNet Server
DNet Server
DNet Client
DNet Client
Computational services
Data sources
WWW
RDBMS
DNet server
DNet server
DNet participating client
DNet clientInternet
Web client
DPML
Knowledgediscoveryservices
XML
DNet Server
Da
ta a
ccess &
Sto
rag
e
Info
Grid
Co
mp
on
en
ts
Co
mp
uta
tion
De
plo
yme
nt
DN
et A
PI
DNet Server
Da
ta a
ccess &
Sto
rag
e
Info
Grid
Co
mp
on
en
ts
Co
mp
uta
tion
De
plo
yme
nt
DN
et A
PI
DNet ServerDNet Server
DNet Server
DNet Client
DNet Server
DNet Client
DNet Client
Computational services
Data sources
WWW
RDBMS
DNet server
DNet server
DNet participating client
DNet clientInternet
Web client
DPMLDPML
Knowledgediscoveryservices
XML
Several types of clients for different usage (from thin web client to
participating client)
Current implmentation based on Java distributed objects (EJB), moving
towards Web/Grid service
But deployment and API access through standard Web/Grid service
Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes
Discovery Process Management
Workflow based service compositionData-flow approach fits Knowledge
Discovery processAllows scientists to develop processes.Towards a Standard Workflow
Representation for Discovery Informatics: Discovery Process Markup Language (DPML):
Contains component data-flow graphs, but alsoRecords collaboration information (user, changes)Records execution constraints (location, parameterisation)Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms
D-Net Workflow for Genome Annotation :
16 services executing across Internet
InfoGrid: Dynamic Data Integration
Integrative Analysis
Chemistry
Gene
Protein /
Targets
Biological
Screening
Clinical
Journals
Sequence
Structure
Location
Function…
Activity
Protocols
Toxicology
Metabolic
Pathways…
Sequence
Expression
Function…
Structures
Libraries
Catalogues
Synthetic
pathways
…
Journals
Project
Reports
Patents…
Trails
Patients…
Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring
Towards a Dynamic Information Integration Methodology:
Specialised Information Source
Access: InfoGrid allows users to
register, locate and connect to
various specialised information
sources.
On the-fly Integration: InfoGrid allows
users to build their own integration
structure on the fly (Worst case:
proprietary protocol/format, best case
JDBC/HTTP-XML-XPath/Web Service).
Easy Maintenance: Wrappers/Drivers
to new data sources can be added
through a clean API
Dynamic Application Integration Services
Dynamic Application Integration = On-demand access and composition of remote analysis components
Towards a Dynamic Component Integration:
Component service: allow users to
register, locate and remotely execute
components (Java component
interface or Web Service port type).
Execution service: allow users to
control the execution of components
distributed environments
Easy Maintenance: New components
can be added through a clean API
Regression
Clustering
Classification
Gene function
perdition
Homology
Search
Promoter
Prediction
D-NET API
Discovery Deployment
Discovery ServiceBatch processing
ReportDiscovery Component
Discovery Process
in DPML
Discovery Deployment = On-demand rapid application construction and publishing
Towards a Dynamic Deployment of Knowledge Discovery Procedures:
Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool.
Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures
Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports.
Knowledge Integration & Interpretation
Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge
Towards a Knowledge Integration Framework: Multi-subject data analysis
Specialised Client Interfaces:
Interactive Analysis and dynamic
component interaction
Result Annotation, Structuring and
Storage: Information source query,
result browsing, sharing and markup
Sequence
Analysis
Text MiningGenetic
Analysis
Pathway
Analysis
Life science example application
Workflow execution
Component execution location resolutionUser list of known resourcesA component can require explicitly to be executed on a particular resourceA component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go)
For unconstrained components, simple “near the data” execution policy:
If single input data location then execute thereOtherwise fallback to original execution location
Allows usual DPKD workflows to be designedHandles data management and transfer (serialisation, Java based, FTP based)
Discovery Net and Grid technologies
Cluster/Campus Grid level:Partial or complete workflow execution on Condor / SGETask farming on subset of the workflow
Global Grid:GSI integration (Java Cog Kit)GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3)Potential use of GRIS or NWS in component implementation
Globus scheduler ? Unicore ? SRB ?
Discovery Net Application Testbeds
Life Science Testbed:Gene sequencing, Protein Chips
High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases
Environmental Modelling Pollution Sensors (GUSTO): SO2, Benzene, ..
High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets
Geo-hazard PredictionMulti-spectral, multi-temporal, Satellite imagery
Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge
GUSTO UNITS with wireless connectivity
Case Study:SC2002 HPC Challenge
blastgenscan
Repeat
Masker
grail
genscanE-PCR
Identify
Genes
Gene markers
tRNAs, rRNAsNon-translated
RNAsRegulatory
Regions
Repetitive
ElementsSegmental
Duplication
SNP
VariationsLiterature
References
…..
3D-PSSMblast
Motif
Search
PFAM
DSCpredator
Inter
Pro
Inter
Pro
SMART
SWISS
PROT
Identify
Functional
Characteisation
Homologues
Domain 3-D Structure
Fold Prediction
Secondary
structureLiterature
References
…..
Proteins
Classify into
Protein Families
IdentifyOrganism
Chromosomes
Organism’s
DNA
RelateCell
Cycle
Metabolism
Drugs
Biological
Process…..Cell death
EmbryogenesisLiterature
References
…..
Ontologies
Pathway
Maps GeneMapsAmiGO
GenNav
virtual
chip
High Throughput
Sequencers
Nucleotide-level
Annotation
Protein-level
Annotation
Process-level
Annotation
NCBIEMBL
TIGR SNP
GO CSNDB
GKKEGG
15 DBs 21 Applications
D-Net based Global Collaborative
Real- Time Genome Annotation
Genome
Annotation
Nucleotide Annotation Workflows
How It Works
Download sequence
from Reference
Server
Save to Distributed Annotation
Server
InteractiveEditor &
Visualisation
Execute distributed annotation workflow
NCBIEMBL
TIGR SNP
Inter
ProSMART
SWISS
PROT
GO
KEGG
1800 clicks 500 Web access200 copy/paste 3 weeks work in 1 workflow and few second execution
Conclusion and Future works
Towards an open integration platform that enables scientists to conduct their KD activitiesSeveral levels of integration requiredEnable use of available resources
Evolution towards cost model integration (performance, value, QoS)Semantic based service retrieval and compositionOther useful standards ? (OGSA-DAI ?)