BioMAS: A Multi-Agent System for Automated Genomic Annotation Keith Decker Department of Computer...

BioMAS: A Multi-Agent System for Automated Genomic Annotation

Keith DeckerDepartment of Computer and Information Sciences

University of Delaware

Salim Khan, Ravi Makkena, Gang SituComputer & Information Sciences

Dr. Carl Schmidt, Heebal KimAnimal & Food Sciences

Outline

General class of problems and MAS solution approach

BioMAS: Automated Genomic Annotation HVDB: HerpesVirus Database ChickDB: Gallus Gallus Database

GOFigure! CoPrDom Signal Transduction Pathway Discovery

What problems are we addressing? Huge, dynamic “Primary Source” Databases

Highly distributed, overlapping Heterogeneous content, structure, curation

Multitude of analysis algorithms Different interfaces, output formats Create contingent process plans chaining many analyses

together Individual PIs, working on non-model organisms

Learn, then hand-navigate sea of DBs and analysis tools Easily overwhelmed by new sequence and EST data Struggle to make results available usefully to others

Approach: Multi-Agent Information Gathering

Software agents for information retrieval, filtering, integration, analysis, and display

Embody heterogeneous database technology (wrappers, mediators, …)

Deal with dynamic data and changing data sources Efficient and robust distributed computation (for both info

retrieval and analysis) Deal with issues of data organization and ownership Natural approach to providing integrated information

To humans via web To other agents via semantic markup [XML/OIL/DAML]

Example: Multi-Agent System for Automated Herpesvirus Annotation

Input raw sequence data Output: an annotated database that allows fairly

complex queries BLAST homologs Motifs Protein domains [Prodomain records] PSORT sub-cellular location predictions GO [Gene Ontology] electronic annotation

“Show me all the genes in Marek’s Disease virus with a tyrosine phosphorylation motif and a transmembrane domain value ≥ 2”

How does this help?

Automates collection of information from various primary source databases

If the info changes, can be updated automatically. PI can be notified. Allows various analyses to be done automatically

Can encode complex (contingent) sequences of info retrieval and linked analyses, report interesting results only

New data sources, annotation, analyses can be applied as they are developed, automatically (open system)

Made available on internet to others, or private data Much more sophisticated queries than keyword search

Dynamic menu of keys Concept hierarchies (“ontology”) allow more concise queries Query planning (e.g., time, resource usage)

Can search across multiple databases (i.e., from other researchers)

How does it work?Sequence Addition Applet User Query Applet Interface Agents

GenBankInfo Extraction Agent

InformationExtraction Agents

ProDomainInfo Extraction Agent

SwissProt/ProSiteInfo Extraction Agent

Psort AnalysisWrapper

Local KnowledgebaseManagement AgentLocal Knowledgebase

Management AgentLocal KnowledgebaseManagement Agent

AnnotationAgent

Task AgentsSequence SourceProcessing Agent

Domain-IndependentTask Agents

QueryProcessing Agent

Matchmaker Agent

Agent Name Server Agent

ProxyAgent

RETSINA-styleMulti-Agent Organization

DECAF: A multi-agent system toolkit

Focus on programming agents, not designing internal architecture

Programming at the multi-agent level Value-added architecture Support for persistent, flexible, robust

actions

Focus on programming agents, not designing internal architecture

Avoiding the API approach DECAF as agent “operating system”,

programmers have strictly limited access Communication, planning, scheduling,

[coordination], execution Graphical dataflow plan editor

DECAF Programming at the multi-agent level Standardized, domain-independent, reusable

“middle agents” Agent Name Server (white pages) Matchmaker (yellow pages/directory service) Brokers (managers) Information extraction (learning [STALKER] +

knowledgebase [PARKA]) Proxy (web interfaces) [Agent Management Agent (debugging, demos, external

control)]

Note: heterogeneous architectures are OK!

Value-added architecture Taking care of details (social/individual)

ANS registration/dereg (eventually MM) Standard behaviors (AMA, error, FIPA, libraries) Message dispatching (ontology, conversation) Coordination (GPGP)

Efficient use of computational resources Highly threaded: internally + domain actions Memory efficient (ran systems for weeks, hundreds

of thousands of messages)

Support for persistent, flexible, robust actions

HTN-style programming Task alternatives and contingencies

RETSINA-style dataflow Provisions/Parameters determine task activation Multiple outcomes, Loops

TÆMS-style task network annotations Dynamic overall utility: Quality, cost, duration task

characteristics Explicit representation of non-local tasks Example: Time/Quality tradeoff

DECAF ArchitecturePlan file Incoming KQML/FIPA messages

Domain Facts and Beliefs

Outgoing KQML/FIPA messages

Action ModulesAction ModulesAction ModulesAction ModulesAction Modules

Incoming Message Queue

Objectives Queue

Task Queue

AgendaQueue

Task TemplatesHash Table

PendingAction Queue

Action Results Queue

AgentInitialization

Dispatcher Planner Scheduler Executor

[concurrent]

Plan Editor

Expanding the Genomic Annotation System Functional AnnotationApplet

SequenceLKBMA GenBankInfo Extraction AgentMouse Genome DBIEASGD (yeast)IEAFlybaseIEA

ProxyAgent Ontology ReasoningAgent OntologyAgent SNP-Finder ESTLKBMA

EST Entry[Chromatograph/FASTA] ProxyAgent ConsensusSequenceChromatographProcessing User QueryApplet Sequence AdditionApplet

SwissProt/ProSiteIEA PSortIEA ProDomainIEA

ProxyAgent AnnotationAgent Sequence SourceProcessing Agent ProxyAgent Query ProcessingAgentBasicSequenceAnnotationFunctionalAnnotationQueryESTProcessing

Functional Annotation Suborganization

Gene Ontology Consortiumwww.geneontology.org • Biological process • Molecular Function • Cellular Component

Co-present Domain Networks (CoPrDom)

Proteins can be viewed as conserved sets of domains Vertex = domain, edge = co-present in some protein, edge

weight = # of proteins co-present in Network constructed from InterPro domain markup of

proteins in 10 species (human, drosophila, c. elegans, s. cerevisiae among them)

Functional characterization via InterPro to GO mapping Network constructed per organism per functional group, eg:

apoptosis regulation in human

Uses for COPRDOM

Functional characterization of unknown domains Identification of core domains/groups in a

functional group Tracking domain evolution through species

evolution Predicting protein-protein interaction by

identifying evolutionary merging of domain groups

Biological Pathway Discovery thru AI Planning Techniques

AI planning is a computational method to develop complex plans of action using the representation of the initial states, the actions which manipulate these states to achieve the goal states specified.

Initial States: The initial state representation of objects in the "plan world"

Actions: Logical descriptions of preconditions and effects

Goals: The end states desired

HTN (Hierarchical Task Network) Planning proceeds by task decompostion of networks, and a successful is one that satisfies a task network.

Uses of the Signal Transduction Planner To produce computer interpretable plans capturing relevant qualitative

information regarding signal transduction pathways.

To produce testable hypotheses regarding gaps in knowledge of the pathway, and drive future signal transduction research in an ordered manner.

To identify key nodes where many pathways are regulated by a node with only 1 functional protein serving as a critical checkpoint.

To perform in silico experiments of hyper expression and deletion mutation.

To enable pathway vizualization tools by providing human- and machine-readable pathway description.

Advantages of Planning

Operator schema: Abstracted axiomatic definitions of sub-cellular processes, understandable to human + computer

Task abstraction: Decomposition of complex task into simpler, interchangeable actions. Reduces search space, conflicts Modeling of pathways at different levels of biochemical detail

Search conducted in Plan Space: Most planners perform bi-directional search (vs. Pathway Tools, Prolog implementations, etc.)

Partial-order Planning: Succinct representation of multiple pathways helps identify key causal relationships

Advantages of Planning (contd.)

Conditional effects can be used to model special cases ("exceptions") when applying operator schema

Resource Utilization can be used to model quantitative aspects such as amplification of a signal, feedback and feed-forward loops

Plan re-use: Old plans can be successfully inserted into new ones (if initial and final conditions are met )without additional computation

(ontologically driven) Operator Schema Example: Transport

(action: transport

:parameters (?mol - macromolecule,

?compfrom, ?compto - compartment)

:condition (and (in ?mol ?compfrom)

(open ?compfrom ?compto))

:effects (and (in ?mol ?compto)

(not (in ?mol ?compfrom)))

RTK-MAPK pathway

Activation of Ras following binding of a hormone (eg. EGF) to a receptor

RTK-MAPK pathway step: O-Plan Output

Phosphorylation of GRB2 at domain Sh2 by the RTK receptor

Summary

Bioinformatics has many features amenable to multi-agent information gathering approach

BioMAS: Automated Analysis: EST processing to functional annotation ontologies

DECAF / RETSINA / TÆMS

GOFigure! And electronic GO annotation CoPrDom Co-Present Domain Analysis Signal Transduction Pathway Discovery

BioMAS Future Work Sophisticated queries are possible, but how to make available to

Biologists?? “Show me all glycoproteins in Marek’s Disease virus with a tyrosine phosphorylation

motif and a transmembrane domain value ≥ 2 that are expressed in feather follicles”

Robustness, efficiency, scale, data materialization issues Automating and integrating more complex analysis processes

(using existing software!) Estimating physical location of genes by synteny

Integrate new data sources Microarray and other gene expression data And thus, more analyses: QTL mapping, metabolic pathway learning

New off-site organism databases and analysis agents

http://www.cis.udel.edu/~decaf/http://www.cis.udel.edu/~decaf/ http://udgenome.ags.udel.edu/http://udgenome.ags.udel.edu/

BioMAS: A Multi-Agent System for Automated Genomic Annotation Keith Decker Department of Computer...

Documents

Biomas acuaticos

biomas grandes analisis

Ecossistemas E Biomas

Biomas Semidesierto.ppt

Ecosistemas y Biomas

Biomas El Norte Chico

Biomas 100525125053-phpapp01

LCA Biomas Combustion

Biomas de américa

Diferentes Biomas

Biomas de Latitudes Medias

Biomas trabajo final

Biomas Terrestres

Biomas del mundo

Biodiesel Biomas

Biomas Cordillera.ppt

Biomas de Chile

Biomas mundiais

Gestion ambiental(biomas)

PRINCIPAIS BIOMAS DO MUNDO