Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Célia Ghedini Ralha Department of Computer Science
University of Brasília [email protected] - www.cic.unb.br/~ghedini/
Agenda
Research Project Overview Problem Hypothesis Results BioAgents
Actual Chalenge NcRNA classification Methodology ncRNA-Agents - Architecture & Prototype
Conclusions & Future Work Brazilian Program Science Without Borders
Personal Introduction
PhD thesis (1993-96) ◦ Prof. Anthony G. Cohn, Professor of Automated Reasoning, University of Leeds, England ◦ QSR Group - Applied Region Connection Calculus to Web information ◦ Title: A Framework for Dynamic Structuring of Information
Academic career (2002) ◦ Associate Professor - Intelligent Information Systems Computer Science Department - University of Brasília, Brazil ◦ Research Group Leader: InfoKnow - Computer Systems for Information and Knowledge
Treatment - Registered Brazil’s National Council for Scientific & Technological Development (CNPq) http://dgp.cnpq.br/buscaoperacional/detalhegrupo.jsp?grupo=024010360AHR2C
◦ member of the Communication Network Laboratory (COMNET) (http://comnet.cic.unb.br/)
◦ research focus – information & knowledge treatment, agent-based modeling, agent simulation
Senior Stage (Sept 2012-Marc 2013), Capes funding ◦ Prof. Gerd Wagner, Chair of Internet Technology Brandenburgische Technische
Universität in Cottbus, Germany ◦ Agent-Object-Relationship (AOR) Modeling Language ◦ scientific director Simurena - startup company specialized in web-based simulation and
games
Research Problem
Challenge – the enormous volumes of DNA & RNA sequences of organisms continuously being discovered by genome projects around the world
Annotation - a key activity: 1. automatically executed; 2. manual done by biologists
use the results of the automatic annotation their knowledge and experience in order to predict the function to each DNA sequence
The Hypothesis
the definition and implementation of an annotation system based on multi-agent approach can help during the complex annotation process
Why to use multi-agent approach?
adequate/direct way to implement agent-based models (ABM) agent-based modeling is a natural metaphor to represent the
interaction of human agents in real environment reasoning & knowledge treatment
ABM is a vital technique for studying Complex Adaptive Systems (CAS) as evidenced by the growing body of literature spanning disciplines ranging
from Biology, Social Sciences to Computer Science, e.g. call for papers from the Springer Complex Adaptive Systems Modeling (CASM) inaugural
special issue publish key papers documenting multidisciplinary methods & applications for ABM of CAS (submission due: November 30, 2012)
papers documenting successful ABM methodologies & application case studies in areas including: Life sciences such as Ecology, Biology, Biochemistry, Cancer and Epidemiology Social sciences Economics Cloud computing Multi-agent systems Verification, validation, and accreditation of agent-based models Methods for development and analysis of agent-based models
Research Project
Agent-based project (2006) • Multi-agent System for Manual Annotation on Genome Sequencing Projects - BioAgents
Collaboration • Institute of Biological Sciences – University of Brasília
• Prof. Marcelo M. Brígido • Bioinformatics Group – Universität Leipzig
• Prof. Peter Stadler BIOFOCO III - Finep/MCTI (2008-2012)
• Goal: Development of Software for genomic analysis in cooperative and distributed computational environment in the Midwest Region of Brazil (http://www.biofoco.org/biofoco3/ ) • Institutions: UnB, UFG, UFMS, Embrapa • Sub-Project: GENOALGO
Research Project (Cont.)
Students ◦ Bachelor
Hugo W. Schneider & Anderson G. Frazzon (Dec 2006) – Multi-agent System Prototype for Manual Annotation in Genome Sequencing Projects (Sanger Sequencer)
Daniel S. Souza (Nov 2012) – A Multi-agent Tool for Biological Sequence Annotation ◦ Master Degree
Richardson S. Lima (Jun 2007) – BioAgents: MAS for manual annotation in genome sequencing projects
Hugo W. Schneider (Dec 2010) – A Reinforcement Learning Method for BioAgents ◦ Registered for PhD
Wosley C. Arruda (started 2010) & Hugo W. Schneider (started 2012)
Publications 1. RALHA, Célia Ghedini; SCHNEIDER, H. W.; WALTER, M. E. T.; BRIGIDO, M. M. A Multi-agent Tool to Annotate
Biological Sequences. ICAART (2) 2011: 226-231. 2. RALHA, Célia Ghedini; SCHNEIDER, H. W.; WALTER, M. E. T; BAZZAN, A. L. C. Reinforcement Learning Method for
BioAgents. SBRN 2010: 109-114. 3. RALHA, Célia Ghedini; SCHNEIDER, H. W.; FONSECA, L. O.; WALTER, M. E. T.; BRIGIDO, M. M. Using BioAgents for
Supporting Manual Annotation on Genome Sequencing Projects. BSB 2008: 127-139. 4. LIMA, R. S.; RALHA, Célia Ghedini; WALTER, M. E. T.; SCHNEIDER, H. W.; PEREIRA, A. G. F.; BRIGIDO, M. M.
BioAgents: Um Sistema Multi-agente para Anotação Manual em Projetos de Seqüênciamento de Genomas. ENIA 2007: 1302-1310.
5. International Workshop on Genomic Databases (IWGD), 2007 & 2005
BioAgents – 1st Vertion (2006)
Sanger technology - thousands of sequences (1990) ◦ submission: biologists send sequences to be processed on
computers - graphics transformed into strings ◦ assembly: sequences are assembled to reconstruct original
fragment ◦ annotation:
1. automatic: computational programs to infer biological functions to each sequence (i.e. BLAST, BLAT), previously stored in public databases (i.e. GenBank, SwissProt, TrEMBL)
2. manual: biologists guarantee accuracy and correctness to each sequence function - using their knowledge to analyze and correct the function suggested by automatic annotation
Computational Pipeline
Architecture (1st Vertion - 2006)
BioAgents – 1st Version – Study Cases
Paullinia cupana (Guaraná fruit)
Paracoccidioides brasiliensis (Pb fungus)
Anaplasma marginale (rickettsia)
BioAgents – 2nd Version (2010)
High-throughput technology - millions of sequences ◦ 454/Roche: 1/1: 5 millions sequences/run (200/600 bp length) ◦ illumina/Solexa: 15/20 millions sequences/run (30/100 bp) ◦ Solid/ABI: > 2 millions sequences/run (35/75 bp length)
submission: short sequences generated by automatic sequencers are "ready" to be processed
mapping: short sequences are mapped into a reference genome, it would be almost impossible to assembly them directly (resequencing)
assembly: de novo sequencing, or near sequences previously mapped automatic annotation: done by computers
Pipeline to high-throughput technology
BioAgents – 2nd Version
New Sequencing Tech. produce billions of bases short time New bioinformatics challenges - store & analyze data volume AI can help many techniques - Machine Learning (ML) Main problems to deal with ML methods annotation scenery: ◦ huge amount of data ◦ lack of examples for training purposes (specificity of each organism) ◦ define: supervised (statistical classification, bayesian), unsupervised
(association rule, clustering), reinforcement learning (RL) ◦ Q-learning (RL), learns an action-value function to give expected
utility of an action in a given state following a fixed policy Q: S X A ->|R (how to define the set of actions (A) per state (S) Didn’t try: (i) delayed Q-learning bringing probably approximately correct
learning bounds to Markov decision processes; (ii) learning automata; (iii) temporal difference learning
Work Objective - Proposes & implements a reinforcement learning method for BioAgents
Architecture (2nd Version - 2010)
2nd Version - Results
Architecture (3rd Version for Web - 2012) Focus of BioAgents in this version is to predict a protein
function to DNA or RNA sequences in genome sequencing projects
The New Challenge
Motivation The classification of non-coding RNA (ncRNA) is a big challenge!! Experts need lots of knowledge & reasoning to classify ncRNA
Problem There are many tools and data bases to help to identify and annotate
ncRNAs, applying different techniques, e.g., BLAST, INFERNAL, tRNAscan-SE, SVM-Portrait, Vienna, NONCODE, RNAdb, miRBase, snoRNA Database, snoRNA for Plants, fRNAdb, Rfam…
But there is no software that can recommend annotation from the results of all these tools together!!
Methodology – three different approaches Homology ◦ Alignment of Pairs
snoRNA RNAdb NONCODE mirBASE Plant - snoRNA
◦ Multiple Alignment Infernal tRNAscan-SE
Class predition ◦ SVM-Portrait (Supervised Learning)
De novo ◦ Vienna Package
RNAfold ◦ RNAmmer
Architecture ncRNA-Agents
Prototype
Conclusions & Future Work
What we done so far: Defined and implemented three versions of BioAgents (annotation, protein) Studied the new challenge of ncRNA detection to define an agent-based model Implemented prototype ncRNA-Agents using multi-agent approach
What is missing? Define the “Seeker”Agent reasoning ◦ WebBlast, snoRNA, miRNA
Define the conflict resolution mechanism (most challenging part!) Improve the rationality of agents ◦ other formal logics for MAS (rule-based reasoning) ◦ machine learning ◦ data mining
Validate ncRNA-Agents with a real genome project (e.g., Pb fungus – wet lab)
We believe Leipzig Bioinformatics Group can help with these new challenge !!!
What is it? A large scale nationwide scholarship program primarily funded by the Brazilian federal
government to strength and expand the initiatives of science and technology, innovation and competitiveness through international mobility of undergraduate and graduate students and researchers. The program also stimulates the visit of highly qualified young researchers and senior visiting professors to Brazil.
Primary Goal: Qualify 75 thousand Brazilian students and researchers in top ranked universities worldwide until 2014
Types of Scholarships: Undergraduate study abroad Full-Time PhD or PhD internships abroad Postdoc Professional Education Senior Fellowships Visiting Researchers/Scholars Fields and Sectors of Interest: Chemistry, Biology, Engineering, Computing & Information Technology, etc…
Célia Ghedini Ralha [email protected]