Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
From Biological Systems to
Computer Sciences and Back
Pedro Pablo González
e-mail: [email protected]
ALMA MATER STUDIORUM UNIVERSITA’ DI BOLOGNA
DEIS, SEDE DI CESENA, ITALIA
Course and Lab
Second part: Bioinformatics and Systems Biology
Bioinformatics and Systems Biology
The many results from the fields of computer
science, information science and engineering have
met with biology, leading to new, emerging
disciplines such as:
• Bioinformatics, and
• Systems biology
Bioinformatics: an overview
Cellular and Molecular
Biology
Computer
sciences
Information
sciences
Engineering
Bioinformatics
Data
visualization
Modeling and
Simulation
Inference Connection Prediction
Bioinformatics: an overview
As the result of application of techniques such as
• Model and simulate some functions of the cell (e.g. protein-
protein interaction, protein structure prediction, gene
expression and gene regulation),
• Make inferences from the molecular biology database,
• Make connections among biological data, and
• Derive useful predictions
• Machine learning,
• Self-organizing maps,
• Statistical algorithms,
• Clustering algorithms, and
• Multi-agent systems
to modern biology, we can actually
Bioinformatics: an overview
• Bio-information processing systems
• Data integration
• Protein interaction pathways and responses
• Protein structure and modeling
• Gene networks
• Microarrays and gene expression patterns
• Sequence analysis
• Gene regulation
• Motif finding
Bioinformatics involves the application of novel techniques
and computational tools in:
Proteome and Proteomics
The term Proteome refers to the complete set of the
proteins produced from all the genes of a genome.
The Proteome is dynamic, it can continually change
its state.
Proteomics is the study of the structure and function
of proteins as well as protein-protein interaction
within a cell. The term Proteomics also refers to the
quantity and distribution of proteins within the cell or
organism.
Proteins
polypeptides
(proteins)Pro Ser Leu Lys Arg
Asp Ala Valpolypeptides
(proteins)
Amino acids and nucleotides are simple organic molecules that form the
two most important constituents of the alive organisms: the polypeptides
(proteins) and the polynucleotides (genes), respectively.
Proteins are constituted from a standard set of 20 amino acids. Proteins
are considered as the most basic building block of live. The protein function
is determined by its particular physical structure and chemical properties.
Types of proteins
Proteins can be classified by biological function into the
following categories:
• Enzymes
• Structural proteins
• Transport proteins
• Antibodies
• Hormones
• Proteins that execute mechanical work
• Gene regulatory proteins
Protein-protein interactions
• In multicellular organisms, decisions about survival,
secretion of some molecule type, growth, differentiation,
senescence or death, are made on the basis of signals.
• Each cell in an organism receives specific combinations
of chemical signals generated by other cells.
• The extra cellular stimuli include growth factors,
hormones, cytokines, neuropeptides, ions and cell-cell
adhesion signals.
• It is essential integrate information from multiple sources
for the cell respond to the development program
appropriately to a wide range of conditions, and thereby
obtain the adaptability and consequential survival of the
organism.
Protein-protein interactions
In biological signal transmission, signaling proteins interact
among them and with other molecules including substrates,
products and regulators (i.e. secondary messengers). The
interaction occurs in varied ways, and the signal transmission
includes several general mechanisms such as:
• Regulation by protein-protein interaction
• Protein phosphorylation
• Protein dephosphorylation
• Production of second messengers
• Cell surface signal recognition systems
Extracellular milieu
Membrane
Cytoplasm
Nucleus
DNA
First messenger
Receptor
Transducer
proteinEffector
enzyme
Second
messenger Protein
kinases
Protein
kinases
Protein
kinases
Protein
Protein
Protein Protein
Protein
Transcription
factor Transcription
factor
Transcription
factor
Activation
Activation
Phosphorylation
Phosphorylation
Activation
Activation
A specific set of cell
surface receptors allows
cells to perceive changes
in the extra cellular
environment. Once the
signals bind to the
receptors, different
processes are activated
generating complex
intracellular signaling
networks. A protein
processing network
mediates the
transmission of extra
cellular signals to
their intracellular targets.
Protein-protein
interaction
EGF
EGFR
Src
Shc Grb2 Gab2 Dbl
p120 Sos
Ras
Rap
Rap
GAP
Cdc42
GAP
cdc42 Rac
MLKRas
GRp
Ras
GRF
KSR
p115p190
Pyk2 Rho
SHP1 SHP2
PLCg
PI3K PAK Raf JAK
PKC PDK AKT MEK MEKK
RSK MKP1 ERK1 ERK2 JNKK
JNK
FosBJunDcJuncFosSRF JunB
CREB cAbl IP1 AP1 EIK1 STAT ATF2
Protein-protein
interaction
The MAPK cascade
is a central signal
transduction
pathway that is
activated by growth
factors, and is
known to be
involved in diverse
cellular functions,
such as cell
proliferation,
differentiation,
senescence and
apoptosis.
Protein-protein interaction: modeling,
simulation and visualization
Models in signaling pathways have been designed using different points
of view, often proposed according to the perspective that each research
group has of a given pathway. This involves the types of information
processing present at cellular level, such as:
• Sequential
• Parallel
• Distributed
• Concurrent
but also the emergent cognitive capabilities exhibited by certain
signaling pathways, such as:
• Memory
• Learning
• Pattern recognition
• Handling fuzzy data
Protein-protein interaction: modeling,
simulation and visualization
Models of multiple interactions of proteins such
as those found in signal transduction allow for
the visualization of the general network. They
can permit the prediction of the effects at the
cellular level of changes or perturbations to the
system. The changes that are required to be
studied include protein mutations or lesions to
sections of the signaling pathway.
Summary of different computational approaches
for the modeling of protein-protein interaction
Computational
approach
Idea behind the
approach
Cognitive capabilities
and types of information
processing
Boolean networks The cell can be modeled as a
network of two-states components
interacting between them. The
state of each component in the
network depends of a particular
Boolean function.
Boolean logic
computation /
parallel processing
Expert systems The interactions (activation,
phosphorylation, etc.) between
signaling network components are
modeled using production rules.
Knowledge-based
inference and
deduction / sequential
and parallel
processing
Summary of different computational approaches
for the modeling of protein-protein interaction
Computational
approach
Idea behind the
approach
Cognitive capabilities
and types of information
processing
Cellular automata The interaction between cells or
molecules is modeled as a matrix,
where the state of an element of
the matrix depends on the states
of the neighboring elements.
None / parallel
processing
Petri nets The cell is seen as a connected
graph with two types of nodes.
One type represents elements,
such as signaling molecules and
proteins, whereas the other type
represents transitions, such as
activations.
None / sequential and
concurrent processing
Summary of different computational approaches
for the modeling of protein-protein interaction
Computational
approach
Idea behind the
approach
Cognitive capabilities
and types of information
processing
Artificial neural
networks (ANN)
The proteins in signaling networks
are seen as artificial neurons in
ANN. Like an artificial neuron, a
protein receives weighted inputs,
produces an output, and has an
activation value.
Memory, learning,
pattern recognition,
etc. / distributed,
parallel, emergent
processing
Multi-agent
systems (MAS)
The cell is seen as a collection of
agents working in parallel. The
agents communicate between
them through messages.
Memory, learning,
pattern recognition,
handling fuzzy data,
adaptive action
selection / distributed,
parallel, emergent
processing
Summary of different computational approaches
for the modeling of protein-protein interaction
Computational
approach
Idea behind the
approach
Cognitive capabilities
and types of information
processing
Continuous models The behavior of proteins
populations depend on their
physicochemical variables as
concentration and affinity. The
dynamic of the model is
represented by this variable
changes with respect to
continuous time, all this is
implemented through differential
equations.
None / distributed,
parallel, emergent
processing
Cellulat: an agent-based intracellular signaling model
González, P.P., Cárdenas, M., Camacho, D., Franyuti, A., Rosas, O. and Lagúnez J.
Instituto de Química, Universidad Nacional Autónoma de México, Ciudad Universitaria, 04510
Mexico, D.F., Mexico
Protein structure prediction
The primary structure
of a protein is its linear
sequence of amino
acids from the
beginning to the end of
the molecule
Protein structure determines protein function
The secondary structure
of a protein refers to how
individual molecules in a
protein are connected to
each other
The tertiary structure of a
protein is its three-
dimensional shape
Protein structure prediction
Protein structure prediction is the prediction of protein tertiary
structure from primary structure. The three-dimensional
structure of a protein is determined by its sequence of amino
acids. For each particular sequence of amino acids there is a
unique three-dimensional native conformation.
Protein
structure
prediction
Protein structure prediction
While the primary structure of a protein can be determined
directly from the DNA sequence of the gene which encodes it,
prediction of the three-dimensional structure of a protein from
its primary structure constitutes one of the great open
problems of molecular biology of the cell.
The knowledge of three-dimensional structure of the proteins
contributes greatly to the functional characterization of genes
and the proteins codified by genes, and the consequent
development of pharmaceutical products to affect that
function.
Protein structure prediction
Computational models for protein structure
prediction
• Bayesian models
• Hidden Markov models
• Neural networks (for secondary protein
structure prediction)
• Genetic algorithms
• Genetic programming
• Multi-agent systems
Genomics
The term “genome”
refers to the
complete set of
chromosomes and
genes of an
organism. The DNA
of higher organisms
is organized into
chromosomes. A
normal human cell
contains 23
chromosome pairs.
Genome
Chromosome
Gene
DNA base pairs
consists of
consists of
consists of
Genes
Genes are constituted of DNA. Genes are segments of DNA that encode
for proteins through the intermediate action of mRNA. The DNA constitutes
the essence of genes. A gene codifying for a particular protein corresponds
to a sequence of nucleotides along a region of a molecule of DNA.
DNA
gene genomic region
surrounding
polynucleotide
(DNA or RNA)C C A T C
A A A G C G Tpolynucleotide
(DNA or RNA)
DNA DNA
Gene expression
All manifestations of life are orchestrated by the regulated expression of the
genes, through the production of specific proteins. Some of the main
strategies for the gene expression regulation include the control of:
1. The available amount of the mRNA,
2. The translating rate of the mRNA molecules in proteins,
and
3. The activity of produced proteins
1. The transcription of the corresponding genes from DNA
into mRNA transcripts, and
2. The translation of the mRNA molecules into proteins
The gene expression involves a complex process series. Among these are:
Gene expression
Transcription
DNA
gene genomic
region
surrounding
untranscribed
sequences
mRNA transcript
Translating
protein
untranslated
regions
Measuring expression gene level
Microarray technology
Measuring expression gene level
Microarray technology
Gene expression data can come from techniques
such as microarrays, expressed sequence tag
(EST) and serial analysis of gene expression
(SAGE) (Liang, 2002). Of these techniques, the
first is now the most developed (Moreau et al.,
2002; Murphy, 2002). A microarray contains
ordered sets of oligonucleotides that permit the
determination of the expression level of genes. A
typical microarray data set includes the
expression level of genes exposed to many
experimental physiological conditions or diverse
tissues.
Gene expression analysis
Several computational tools have been developed for the
analysis of gene expression data including techniques such
as:
• Clustering algorithms (Eisen et al., 1998;
Moreau, 2002)
• Statistic methods
• Self-organizing maps (Tamayo et al., 1999)
• Machine learning (Kim et al., 2000)
Gene expression analysis:
clustering algorithms
Clustering implies grouping together objects that are similar to each
other. Many clustering techniques have been applied to the analysis of
gene expression patterns. These are useful, when they are correctly
applied to a set of gene expression profiles.
Clustering techniques can be classified as:
• Hierarchical clustering
• Non-hierarchical clustering
• Divisive clustering
• Agglomerative clustering
• Supervised clustering
• Non supervised clustering
Gene expression analysis:
hierarchical clustering algorithms
When the hierarchical clustering is used, each gene expression profile is
initially assigned to a single cluster. After that, the following iterative
process is executed: initiated in which a distance between every pair of
clusters is calculated and the two closest clusters are grouped together.
1. A distance between every pair of clusters is calculated according
to a certain distance measure (a distance matrix is generated for
all genes to be clustered).
2. The two most similar clusters are selected (initially each cluster
consists of a single gene).
3. The two closest clusters are merged, to produce a new cluster,
which contains at least two genes.
4. The distance matrix is update, calculating the distances
between this new cluster and all other clusters.
5. Steps 2 to 4 are repeated until all genes are in one cluster.
Gene expression analysis:
hierarchical clustering algorithms
The hierarchical clustering process gives rise a tree structure, which is
presented as a dendrogram. The final clusters are selected by cutting the
tree at certain level of the dendrogram.
Dendrogram
Gene expression profile
The height of
the branches
is proportional
to the distance
between the
clusters
Gene expression analysis:
self-organizing maps (SOM)
In a SOM the genes are assigned to a series of clusters on the basis of the
similarity of their expression vectors to cluster center/reference vectors.
Before initiating the analysis, the user has to predefine a topology of nodes
(typically a two dimensional rectangular or hexagonal grid, where each
node into the grid represents a cluster). Initially, random reference vector
are generated and assigned for each node. After that, an iterative process
involves the following steps:
1. A gene expression pattern is chosen at random, and the
node that maps closest to it is selected,
2. The reference vector of the selected node (in gene
expression space) is then adjusted so that it is more similar
to the gene expression vector assigned,
3. The reference vectors of the other nodes that are close to
the selected node on the two-dimensional grid are also
adjusted
Gene expression analysis:
self-organizing maps (SOM)
Microarray…
…
…
…
Gene expression patterns
SOM
Genes are grouped into the clusters. Each cluster is represented by the average pattern for genes in the cluster
An Evolving Neural Network for the Interpretation of Gene Expression
Patterns
Microarray Gene expression patterns
Evolving neural network (NBIA)
Genes are grouped into the clusters. Each cluster is represented by the average pattern for genes in the cluster
*Márquez M.C., **González, P.P. and ***Lagúnez, J.
*Posgrado en Ciencias de la Computación, UNAM, ***Instituto de Química, UNAM, **DEIS,
Università di Bologna
OMICS: A Journal of Integrative Biology
Jun 2005, Vol. 9, No. 2: 209-217
Gene regulation (I)
DNA
gene
Promoter
region
BS: binding site
TF: transcription factor
BS BS BS
TF TF TF
protein
DNADNA DNA
Pattern recognition
(motif finding)
Gene regulation (II)
Microarray Gene expression patterns
Cluster algorithm
Clusters
YER001W
YPL163C
YMR215W
YOR248W
YOR247W
YDR225W
YDR224C
YBL003C
YBL002W
Class 6
Genes in cluster
Motif finding
Found motif
(pattern)
Gene regulation (III)
Approaches to modeling gene regulation:
• Qualitative causal models
• Quantitative models
• Bayesian networks
• Learning matching