Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration

Discovery Net : A UK e-Science Pilot Project

for Grid-based Knowledge Discovery Services

Patrick WendelImperial College, London

Data Mining and Exploration Middleware for Distributed and Grid

Computing,

September 18-19, 2003

Why Discovery Net?

Data Challenge: Distributed, heterogeneous & large scale data sets

Novel and real-time data sourcesResource Challenge

Novel specialised data analysis components/services continually being published/made available

Computational resources provided

Information Challenge: Data cleaning, normalisation & calibration

New data needs to be related to existing data

Knowledge Challenge:Collaborative, interactive & people-intensive

Result interpretation & validation in relation to existing knowledge

Knowledge sharing is key

What is Discovery Net

Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services

•Key Technologies:• Grid and Distributed Computing• Workflow and service composition• Data Mining & Visualisation.• Data Access & Information Structuring.• High Throughput Screening Devices: real-time.

Discovery Net: Unifying the World’s Knowledge

Data Integration: Dynamic Real Time Construction of “Data Grids”

Application Integration: Component and Service-based Integration

People Integration:Global-wide Discovery Groupware

Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work

Using Distributed

Resources

What is Discovery Net

Real Time Integration

Dynamic Application

Integration

Workflow Construction

Interactive Visual

Analysis

Discovery Net Layer Model(Life Science Application)

High Performance

and Grid-enabled

Transfer Protocol

(GSI-FTP, DSTP..)

Grid-enabled

Infrastructure

(GSI)

Deployment

Web/Grid Services

OGSA

D-Net Clients:End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities

D-Net Middleware:

Provides execution logic for distributed knowledge discovery and access to distributed resources

Computation & Data Resources:

Distributed databases, compute servers and scientific devices.

A Knowledge Grid based on D-Net Servers

DNet Server

Da

ta a

ccess &

Sto

rag

e

Info

Grid

Co

mp

on

en

ts

Co

mp

uta

tion

De

plo

yme

nt

DN

et A

PI

DNet Server

DNet Server

DNet Client

DNet Client

Computational services

Data sources

WWW

RDBMS

DNet server

DNet server

DNet participating client

DNet clientInternet

Web client

DPML

Knowledgediscoveryservices

XML

DNet Server

Da

ta a

ccess &

Sto

rag

e

Info

Grid

Co

mp

on

en

ts

Co

mp

uta

tion

De

plo

yme

nt

DN

et A

PI

DNet Server

Da

ta a

ccess &

Sto

rag

e

Info

Grid

Co

mp

on

en

ts

Co

mp

uta

tion

De

plo

yme

nt

DN

et A

PI

DNet ServerDNet Server

DNet Server

DNet Client

DNet Server

DNet Client

DNet Client

Computational services

Data sources

WWW

RDBMS

DNet server

DNet server

DNet participating client

DNet clientInternet

Web client

DPMLDPML

Knowledgediscoveryservices

XML

Several types of clients for different usage (from thin web client to

participating client)

Current implmentation based on Java distributed objects (EJB), moving

towards Web/Grid service

But deployment and API access through standard Web/Grid service

Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes

Discovery Process Management

Workflow based service compositionData-flow approach fits Knowledge

Discovery processAllows scientists to develop processes.Towards a Standard Workflow

Representation for Discovery Informatics: Discovery Process Markup Language (DPML):

Contains component data-flow graphs, but alsoRecords collaboration information (user, changes)Records execution constraints (location, parameterisation)Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms

D-Net Workflow for Genome Annotation :

16 services executing across Internet

InfoGrid: Dynamic Data Integration

Integrative Analysis

Chemistry

Gene

Protein /

Targets

Biological

Screening

Clinical

Journals

Sequence

Structure

Location

Function…

Activity

Protocols

Toxicology

Metabolic

Pathways…

Sequence

Expression

Function…

Structures

Libraries

Catalogues

Synthetic

pathways

…

Journals

Project

Reports

Patents…

Trails

Patients…

Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring

Towards a Dynamic Information Integration Methodology:

Specialised Information Source

Access: InfoGrid allows users to

register, locate and connect to

various specialised information

sources.

On the-fly Integration: InfoGrid allows

users to build their own integration

structure on the fly (Worst case:

proprietary protocol/format, best case

JDBC/HTTP-XML-XPath/Web Service).

Easy Maintenance: Wrappers/Drivers

to new data sources can be added

through a clean API

Dynamic Application Integration Services

Dynamic Application Integration = On-demand access and composition of remote analysis components

Towards a Dynamic Component Integration:

Component service: allow users to

register, locate and remotely execute

components (Java component

interface or Web Service port type).

Execution service: allow users to

control the execution of components

distributed environments

Easy Maintenance: New components

can be added through a clean API

Regression

Clustering

Classification

Gene function

perdition

Homology

Search

Promoter

Prediction

D-NET API

Discovery Deployment

Discovery ServiceBatch processing

ReportDiscovery Component

Discovery Process

in DPML

Discovery Deployment = On-demand rapid application construction and publishing

Towards a Dynamic Deployment of Knowledge Discovery Procedures:

Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool.

Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures

Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports.

Knowledge Integration & Interpretation

Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge

Towards a Knowledge Integration Framework: Multi-subject data analysis

Specialised Client Interfaces:

Interactive Analysis and dynamic

component interaction

Result Annotation, Structuring and

Storage: Information source query,

result browsing, sharing and markup

Sequence

Analysis

Text MiningGenetic

Analysis

Pathway

Analysis

Life science example application

Workflow execution

Component execution location resolutionUser list of known resourcesA component can require explicitly to be executed on a particular resourceA component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go)

For unconstrained components, simple “near the data” execution policy:

If single input data location then execute thereOtherwise fallback to original execution location

Allows usual DPKD workflows to be designedHandles data management and transfer (serialisation, Java based, FTP based)

Discovery Net and Grid technologies

Cluster/Campus Grid level:Partial or complete workflow execution on Condor / SGETask farming on subset of the workflow

Global Grid:GSI integration (Java Cog Kit)GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3)Potential use of GRIS or NWS in component implementation

Globus scheduler ? Unicore ? SRB ?

Discovery Net Application Testbeds

Life Science Testbed:Gene sequencing, Protein Chips

High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases

Environmental Modelling Pollution Sensors (GUSTO): SO2, Benzene, ..

High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets

Geo-hazard PredictionMulti-spectral, multi-temporal, Satellite imagery

Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge

GUSTO UNITS with wireless connectivity

Case Study:SC2002 HPC Challenge

blastgenscan

Repeat

Masker

grail

genscanE-PCR

Identify

Genes

Gene markers

tRNAs, rRNAsNon-translated

RNAsRegulatory

Regions

Repetitive

ElementsSegmental

Duplication

SNP

VariationsLiterature

References

…..

3D-PSSMblast

Motif

Search

PFAM

DSCpredator

Inter

Pro

Inter

Pro

SMART

SWISS

PROT

Identify

Functional

Characteisation

Homologues

Domain 3-D Structure

Fold Prediction

Secondary

structureLiterature

References

…..

Proteins

Classify into

Protein Families

IdentifyOrganism

Chromosomes

Organism’s

DNA

RelateCell

Cycle

Metabolism

Drugs

Biological

Process…..Cell death

EmbryogenesisLiterature

References

…..

Ontologies

Pathway

Maps GeneMapsAmiGO

GenNav

virtual

chip

High Throughput

Sequencers

Nucleotide-level

Annotation

Protein-level

Annotation

Process-level

Annotation

NCBIEMBL

TIGR SNP

GO CSNDB

GKKEGG

15 DBs 21 Applications

D-Net based Global Collaborative

Real- Time Genome Annotation

Genome

Annotation

Nucleotide Annotation Workflows

How It Works

Download sequence

from Reference

Server

Save to Distributed Annotation

Server

InteractiveEditor &

Visualisation

Execute distributed annotation workflow

NCBIEMBL

TIGR SNP

Inter

ProSMART

SWISS

PROT

GO

KEGG

1800 clicks 500 Web access200 copy/paste 3 weeks work in 1 workflow and few second execution

Conclusion and Future works

Towards an open integration platform that enables scientists to conduct their KD activitiesSeveral levels of integration requiredEnable use of available resources

Evolution towards cost model integration (performance, value, QoS)Semantic based service retrieval and compositionOther useful standards ? (OGSA-DAI ?)

Documents

Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration