Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
12/16/2002Nik Kasabov - Evolving Connectionist Systems
Chapter 8Data Analysis, Modelling and
Knowledge Discovery in Bioinformatics
Prof. Nik [email protected]://www.kedri.info
Nik Kasabov - Evolving Connectionist Systems
• Bioinformatics - an area of information growth and emergence of knowledge
• Dynamic DNA and RNA sequence data analysis and knowledge discovery
• Gene expression data analysis, rule extraction, and disease profiling
• Fuzzy evolving clustering of genes according to their time-course expression
• Protein secondary structure prediction • Dynamic cell modelling
Overview
Nik Kasabov - Evolving Connectionist Systems
Biology Basics• DNA ( Dioxyribonucleic Acid) is a chemical chain,
present in the nucleus of each cell of an organism• The whole process of DNA transcription, gene
translation, and protein production is continuous and it evolves over time
• RNA (ribonucleic acid) has a similar structure as the DNA except for one chemical molecule
• Genes are complex chemical structures and cause dynamic transformation of one substance into another during the whole life of an individual, as well as the life of the human population over many generations
• Modelling these interactions, learning about them and extracting knowledge, is a major goal for Bioinformatics
Nik Kasabov - Evolving Connectionist Systems
Bioinformatics
• First draft of human genome is completed, now the challenge is to be able to process the vast amount of dynamic information and to create intelligent systems for prediction and knowledge discoveries at different levels of life, from cell to whole organisms and species.
• Bioinformatics is concerned with the application of the methods of information sciences for the analysis, modelling and knowledge discovery of biological processes in living organisms
Nik Kasabov - Evolving Connectionist Systems
BioinformaticsA schematic representation of the central dogma of molecular biology; from DNA to RNA (transcription) and from RNA to proteins (translation). (Fig 8.1)The central dogma of the molecular biology states that the DNA is transcribed into RNA, which is translated into proteins.
Nik Kasabov - Evolving Connectionist Systems
Life-long Learning & Evolution
• Through evolution genes are slowly modified over many generations of populations of individuals and selection processes (e.g. natural selection).
• Evolutionary processes imply the development of generations of populations of individuals where crossover, mutation, selection of individuals, based on fitness criteria are applied in addition to the learning processes of each individual
• A biological system evolves its structure and functionality through both, life-long learning of an individual, and evolution of populations of many such individuals,
Nik Kasabov - Evolving Connectionist Systems
Computational Modelling in Molecular Biology
• There are five main phases of information processing and problem solving in most bioinformatics systems:
1. Data collection, e.g. collecting biological samples and processing them.
2. Feature analysis and feature extraction3. Modelling the problem4. Knowledge discovery in silico5. Verifying the discovered knowledge in vitro and in vivo
Nik Kasabov - Evolving Connectionist Systems
Computational Modelling in Molecular Biology
• Some of the modelling techniques (decision trees, KBNN) allow for extracting knowledge – e.g. rules from the models, that can be used for explanation or for knowledge discovery.
• For large data sets and for continuously incoming data streams that require the model and the system to rapidly adapt to new data, it is more appropriate to use on-line, knowledge based techniques and ECOS in particular as it is demonstrated in this chapter.
• There are many problems in Bioinformatics that require their solutions in the form of a dynamic, learning, knowledge based system
• An ultimate task for bioinformatics would be predicting the development of an organism from its DNA code
Nik Kasabov - Evolving Connectionist Systems
Dynamic DNA & RNA Sequence Analysis
• Analysis of a DNA sequence and identifying promoter regions
• Identify splice junction (E/I, or I/E, or None):
Nik Kasabov - Evolving Connectionist Systems
On-line learning of ribosome binding site data (fig 8.3)
0 200 400 600 800 1000 1200 1400 1600-0.5
0
0.5
1
Des
ired
and
Act
ual
0 200 400 600 800 1000 1200 1400 16000
20
40
60
80
100
Num
ber
of r
ule
node
s
Nik Kasabov - Evolving Connectionist Systems
Identify intron/exon splice junction
EXTRACTION OF RULES:
Rule1: if ----------------------------AGGT-AG------------------------- then [EI]
Rule8: if ------------------T------T-CAG------------------------------ then [IE]
Nik Kasabov - Evolving Connectionist Systems
Gene Expression Data: Biological Perspective• Microarray equipment is used widely at present to
evaluate the level of gene expression in a tissue, or in a living cell.
• Each point (pixel, cell) in a microarray represents the level of expression of a single gene
• Microarray analysis might not identify unique markers (e.g. a single gene) of clinical utility for a disease because of the heterogeneity of the disease, but a prediction of the biological state of disease is likely to be more sensitive by identifying clusters of gene expression (profiles)
• Gene expression clustering has been used to distinguish normal colon samples from tumours from within a 6,500 gene set.
• Another example of profiling developed in this chapter is for the distinction between two subtypes of Leukaemia, namely AML and ALL.
Nik Kasabov - Evolving Connectionist Systems
Gene Expression Data Analysis• A gene profile is a pattern of expression of a
number of genes that is typical for all, or for some of the known samples of a particular disease.
• A disease profile would look like:» IF (gene g1 is highly expressed) AND (gene g37 is low
expressed) AND (gene 134 is very highly expressed) THEN most probably this is cancer type C (123 out of available 130 sampleshave this profile),
• This profile can be matched against existing gene profiles and based on similarity, it can be predicted with certain probability if the patient is in an early phase of a disease or he/she is at risk of developing the disease in the future with certain probability.
Nik Kasabov - Evolving Connectionist Systems
Gene expression data analysis, modelling and knowledge discovery
• Goal: identify a gene or a group of genes associated with the state of the cell (tissue), e.g. cancer.
• Large number of genes (appr. 30,000) expressed in a microarray (in vitro) from a single tissue.
• It is difficult to find consistent patterns of gene expression for a class of tissue
• After all, a microarray data is just of few microseconds snapshot of what is happening in the cell
• Genes interact – how do we find out about that?
• Growing number of examples
Nik Kasabov - Evolving Connectionist Systems
Fuzzy representation of gene expression data
Nik Kasabov - Evolving Connectionist Systems
Gene Profiling Methodology• Phases:
1. Microarray data pre-processing.2. Selecting a set of significant differentially expressed
genes across the classes.3. Finding subsets of (a) under-expressed genes, and (b)
over-expressed genes, from the selected ones in the previous step.
4. Clustering of the gene sets from (3) that would reveal preliminary profiles of jointly over-expressed/under-expressed genes across the classes.
5. Building a classification model and extracting rules that define the profiles for each class.
Nik Kasabov - Evolving Connectionist Systems
Gene Expression Knowledge Discovery
• Goal: identify a gene or a group of genes associated with the state of the cell (tissue), e.g. cancer.
• Large number of genes (appr. 30,000) expressed in a microarray (in vitro) from a single tissue.
• It is difficult to find consistent patterns of gene expression for a class of tissue
• After all, a microarray data is just of few microseconds snapshot of what is happening in the cell
• Genes interact – how do we find out about that? • Growing number of examples and complexity.
Nik Kasabov - Evolving Connectionist Systems
Case Study: Gene Profiling of Colon Cancer using EFuNN
• Rule 1:IF M24902 (High 0.988) and H13238 (Low 0.991) and H16758 (High0.995) and X90908(Low 0.992) and T55255(Low 0.998) THEN COLON CANCER (High 1.0) (receptive field 0.5, examples explained by the rule 23/40;
• Rule 2:IF T71662(Low 0.984) and X76383(High 0.985) and X54938(Low 0.989) and H88522(Low 0.987) and H92523(High 0.989) THEN NORMAL TISSUE(High 1.0) (receptive field 0.19; examples explained by this rule 13/22; used thresholds for the condition membership degrees 0.98 and for
the conclusion memb. degrees 0.95)
• Two of the 12 extracted rules that reveal some conditions for a colon cancer against normal tissue. Each rule represents a sub-class (cluster) of each of the two classes.
Nik Kasabov - Evolving Connectionist Systems
Disease Profiling Through Rule Extraction from EFuNN
Rule extraction from EFuNNs:» Input space restricted to genes with high significance (e.g. 98 genes
for the colon cancer data set (Alon et al)» Rule extraction after learning in an EFuNN» Rules represent disease profiles» Proper visualization for a better understanding
Nik Kasabov - Evolving Connectionist Systems
Dynamic modeling and knowledge discovery
from 14 cancer type gene expression data• A continuous flow of data• An adaptive “mother
model” is being created and updated over time: new data; new genes; new classes
• At any time, an “optimal simple model” is extracted and analyzed
• Rules are extracted and genes arte analyzed
• Example: Ramaswami’s data (PNAS,January,2002) of 14 types of cancer
• Future work: dynamic modeling of gene interaction networks and cell development prognosis
Nik Kasabov - Evolving Connectionist Systems
Using Evolving Self-organising Maps ESOM for
clustering of time course gene expression data
On-line clustering of time-course gene expression data by ESOMs
(Da Deng, and N. Kasabov, 2002,Neurocomputing)
Nik Kasabov - Evolving Connectionist Systems
Amino Acid codonsThe codons of each of the 20 amino acids. The first column represents the first base in the triplet, the first row the second base, and the last column the last base (Table 8.6)
Nik Kasabov - Evolving Connectionist Systems
Protein Structure Prediction• The mRNA is translated by ribosomes into proteins• A protein is a sequences of amino-acids, each of them
defined by a group of 3 nucleotides (codons) • 20 amino acids all together (A,C-H,I,K-N,P-T,V,W,Y)• Initiation and stop codons• Proteins have complex structures:
» Primary (linear),» Secondary (3D, defining functionality)» Tertiary (high level energy minimisation packing), » Quaternary (interaction between molecules)
• The Protein Data Bank – www.rcsb.org - 100,000 hits a day on average
Nik Kasabov - Evolving Connectionist Systems
Protein Structure Prediction• Predicting the secondary structure from the primary • Segments from a protein can have different shapes:
» Helix» Sheet » Coil (loop)
• ANN is trained on existing data to predict the shape of an arbitrary new segment; window of 13 amino-acids
• 273 inputs – 3 outputs; 18,000 examples for training • Research done mainly by Mike Watts in collaboration with
Natural Selection Inc., based in La Jolla, California.
Nik Kasabov - Evolving Connectionist Systems
Proteins and protein structure prediction
• The mRNA is translated into proteins• A protein is a sequences of amino-
acids, each of them defined by a group of 3 nucleotides (codons)
• 20 amino acids all together (A,C-H,I,K-N,P-T,V,W,Y)
• Initiation and stop codons• Proteins have complex structures:
» Primary (linear),» Secondary (3D, defining functionality)» Tertiary ( energy minimisation packs), » Quaternary (interaction between molecules)
• The Protein Data Bank – www.rcsb.org -100,000 hits a day on average
Nik Kasabov - Evolving Connectionist Systems
Towards comprehensive EI for bioinformatics applications
• Hybrid models• Using all available information – gene expression, biological,
clinical, etc. à comprehensive simulation systemsCell Parameters System Parameters
DNA data of aliving cell
RNA data
Protein data
Existing data bases
(DNA, Genes, Proteins,Metabolic networks)
New knowledge extracted
Output information
Evolving model of a cell
Nik Kasabov - Evolving Connectionist Systems
Dynamic Cell Modelling• “The cell is never conquered until its total behaviour is
understood, and the total behaviour of the cell is never understood until it is modelled and simulated” . (Tomita, 2001)
• Computer modelling of processes in living cells is an extremely difficult task. » The processes in a cell are dynamic and depend on many variables
some of them related to a changing environment.» The processes of DNA transcription, and protein translation are not
fully understood.
• Several cell models have been created and experimented• A starting point to dynamic modelling of a cell would be dynamic
modelling of a single gene regulation process• The next step in dynamic cell modelling would be to try and
model the regulation of more genes, hopefully a large set of genes
Nik Kasabov - Evolving Connectionist Systems
Genetic networks and reverse engineering
• GN describe the regulatory interaction between genes• Reverse engineering – from gene expression data to GN. • It is assumed that gene expression data reflects the
underlying genetic regulatory network • Co-expressed genes over time – either one regulates the
other, or both are regulated by same other genes• What is the time unit?• Appropriate data needed• Validation procedure• Correct interpretation of the models may generate new
biological knowledge
Nik Kasabov - Evolving Connectionist Systems
Evolving fuzzy neural networks for GRN modeling
G(t) EFuNN G(t+dt)
• On-line, incremental learning of a GN
• Adding new inputs/outputs (new genes)
• The rule nodes capture clusters of input genes that are related to the output genes
• Rules can be extracted that explain the relationship between G(t) and G(t+dt), e.g.:
• IF g13(t) is High (0.87) and g23(t) is Low (0.9)
THEN g87 (t+dt) is High (0.6) and g103(t+dt) is Low
• Playing with the threshold will give stronger or weaker patterns of relationship
Nik Kasabov - Evolving Connectionist Systems
DENFIS: Dynamic, evolving neuro-fuzzy inference systems for GN modeling
(IEEE Trans. FS, April, 2002)
• G(t) -> gj(t+dt)
• Dynamic partitioning of the input space
• Takagi-Sugeno fuzzy rules, e.g.:
if G1 is ( 0.63 0.70 0.76) andG2 is ( 0.71 0.77 0.84) andG3 is ( 0.71 0.77 0.84) andG4 is ( 0.59 0.66 0.72) and
then Gy = 1.84 - 1.26 X1 - 1.22X2+ 0.58X3 - 0.03 X4
Nik Kasabov - Evolving Connectionist Systems
Summary• Modelling biological processes is aiming at the
creation of models that trace these processes over time.
• The models should reveal the steps of development, the metamorphoses that occur at different points of time, the “trajectories” of the developed patterns.
• Biological processes are dynamically evolving and they require appropriate techniques, such as evolving connectionist systems.
Nik Kasabov - Evolving Connectionist Systems
Further Readings• Computational Molecular Biology (Pevzner, 2001).• Applications of neural network methods, mainly multiplayer perceptrons and self-
organising maps, in the general area of genome informatics (Wu and McLarty, 2000).
• Microarray gene technologies (Schena, 2000).• Data mining in biotechnology (Persidis, 2000).• Application of the theory of complex systems for dynamic gene mo delling ( Bar-
Yam, 1997).• Computational modelling of genetic and biochemical networks (Bower and
Bolouri, 2001).• Dynamic modelling of the regulation of a large set of genes (Somogyi et al, 2001;
D’haeseleer et al, 2000).• Methodology for gene expression profiling (Futschik, et al, 2002; Futschik, 2002).• Using fuzzy neural networks and evolving fuzzy neural networks in bioinformatics
(Kasabov, Futschik and Middlemiss, 2000). • Fuzzy clustering for gene expression analysis (Futschik and Kasabov, 2002).• Artificial neural filters for pattern recognition in protein sequences (Schneider and
Wrede, 1993).• Dynamic models of the cell (Schaff and Loew, 1999; Tomita et al, 1999; Kohn
and Dimitrov, 2000).