39
From Biological Systems to Computer Sciences and Back Pedro Pablo González e-mail: [email protected] ALMA MATER STUDIORUM UNIVERSITA’ DI BOLOGNA DEIS, SEDE DI CESENA, ITALIA Course and Lab Second part: Bioinformatics and Systems Biology

From Biological Systems to Computer Sciences and Back

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Biological Systems to Computer Sciences and Back

From Biological Systems to

Computer Sciences and Back

Pedro Pablo González

e-mail: [email protected]

ALMA MATER STUDIORUM UNIVERSITA’ DI BOLOGNA

DEIS, SEDE DI CESENA, ITALIA

Course and Lab

Second part: Bioinformatics and Systems Biology

Page 2: From Biological Systems to Computer Sciences and Back

Bioinformatics and Systems Biology

The many results from the fields of computer

science, information science and engineering have

met with biology, leading to new, emerging

disciplines such as:

• Bioinformatics, and

• Systems biology

Page 3: From Biological Systems to Computer Sciences and Back

Bioinformatics: an overview

Cellular and Molecular

Biology

Computer

sciences

Information

sciences

Engineering

Bioinformatics

Data

visualization

Modeling and

Simulation

Inference Connection Prediction

Page 4: From Biological Systems to Computer Sciences and Back

Bioinformatics: an overview

As the result of application of techniques such as

• Model and simulate some functions of the cell (e.g. protein-

protein interaction, protein structure prediction, gene

expression and gene regulation),

• Make inferences from the molecular biology database,

• Make connections among biological data, and

• Derive useful predictions

• Machine learning,

• Self-organizing maps,

• Statistical algorithms,

• Clustering algorithms, and

• Multi-agent systems

to modern biology, we can actually

Page 5: From Biological Systems to Computer Sciences and Back

Bioinformatics: an overview

• Bio-information processing systems

• Data integration

• Protein interaction pathways and responses

• Protein structure and modeling

• Gene networks

• Microarrays and gene expression patterns

• Sequence analysis

• Gene regulation

• Motif finding

Bioinformatics involves the application of novel techniques

and computational tools in:

Page 6: From Biological Systems to Computer Sciences and Back

Proteome and Proteomics

The term Proteome refers to the complete set of the

proteins produced from all the genes of a genome.

The Proteome is dynamic, it can continually change

its state.

Proteomics is the study of the structure and function

of proteins as well as protein-protein interaction

within a cell. The term Proteomics also refers to the

quantity and distribution of proteins within the cell or

organism.

Page 7: From Biological Systems to Computer Sciences and Back

Proteins

polypeptides

(proteins)Pro Ser Leu Lys Arg

Asp Ala Valpolypeptides

(proteins)

Amino acids and nucleotides are simple organic molecules that form the

two most important constituents of the alive organisms: the polypeptides

(proteins) and the polynucleotides (genes), respectively.

Proteins are constituted from a standard set of 20 amino acids. Proteins

are considered as the most basic building block of live. The protein function

is determined by its particular physical structure and chemical properties.

Page 8: From Biological Systems to Computer Sciences and Back

Types of proteins

Proteins can be classified by biological function into the

following categories:

• Enzymes

• Structural proteins

• Transport proteins

• Antibodies

• Hormones

• Proteins that execute mechanical work

• Gene regulatory proteins

Page 9: From Biological Systems to Computer Sciences and Back

Protein-protein interactions

• In multicellular organisms, decisions about survival,

secretion of some molecule type, growth, differentiation,

senescence or death, are made on the basis of signals.

• Each cell in an organism receives specific combinations

of chemical signals generated by other cells.

• The extra cellular stimuli include growth factors,

hormones, cytokines, neuropeptides, ions and cell-cell

adhesion signals.

• It is essential integrate information from multiple sources

for the cell respond to the development program

appropriately to a wide range of conditions, and thereby

obtain the adaptability and consequential survival of the

organism.

Page 10: From Biological Systems to Computer Sciences and Back

Protein-protein interactions

In biological signal transmission, signaling proteins interact

among them and with other molecules including substrates,

products and regulators (i.e. secondary messengers). The

interaction occurs in varied ways, and the signal transmission

includes several general mechanisms such as:

• Regulation by protein-protein interaction

• Protein phosphorylation

• Protein dephosphorylation

• Production of second messengers

• Cell surface signal recognition systems

Page 11: From Biological Systems to Computer Sciences and Back

Extracellular milieu

Membrane

Cytoplasm

Nucleus

DNA

First messenger

Receptor

Transducer

proteinEffector

enzyme

Second

messenger Protein

kinases

Protein

kinases

Protein

kinases

Protein

Protein

Protein Protein

Protein

Transcription

factor Transcription

factor

Transcription

factor

Activation

Activation

Phosphorylation

Phosphorylation

Activation

Activation

A specific set of cell

surface receptors allows

cells to perceive changes

in the extra cellular

environment. Once the

signals bind to the

receptors, different

processes are activated

generating complex

intracellular signaling

networks. A protein

processing network

mediates the

transmission of extra

cellular signals to

their intracellular targets.

Protein-protein

interaction

Page 12: From Biological Systems to Computer Sciences and Back

EGF

EGFR

Src

Shc Grb2 Gab2 Dbl

p120 Sos

Ras

Rap

Rap

GAP

Cdc42

GAP

cdc42 Rac

MLKRas

GRp

Ras

GRF

KSR

p115p190

Pyk2 Rho

SHP1 SHP2

PLCg

PI3K PAK Raf JAK

PKC PDK AKT MEK MEKK

RSK MKP1 ERK1 ERK2 JNKK

JNK

FosBJunDcJuncFosSRF JunB

CREB cAbl IP1 AP1 EIK1 STAT ATF2

Protein-protein

interaction

The MAPK cascade

is a central signal

transduction

pathway that is

activated by growth

factors, and is

known to be

involved in diverse

cellular functions,

such as cell

proliferation,

differentiation,

senescence and

apoptosis.

Page 13: From Biological Systems to Computer Sciences and Back

Protein-protein interaction: modeling,

simulation and visualization

Models in signaling pathways have been designed using different points

of view, often proposed according to the perspective that each research

group has of a given pathway. This involves the types of information

processing present at cellular level, such as:

• Sequential

• Parallel

• Distributed

• Concurrent

but also the emergent cognitive capabilities exhibited by certain

signaling pathways, such as:

• Memory

• Learning

• Pattern recognition

• Handling fuzzy data

Page 14: From Biological Systems to Computer Sciences and Back

Protein-protein interaction: modeling,

simulation and visualization

Models of multiple interactions of proteins such

as those found in signal transduction allow for

the visualization of the general network. They

can permit the prediction of the effects at the

cellular level of changes or perturbations to the

system. The changes that are required to be

studied include protein mutations or lesions to

sections of the signaling pathway.

Page 15: From Biological Systems to Computer Sciences and Back

Summary of different computational approaches

for the modeling of protein-protein interaction

Computational

approach

Idea behind the

approach

Cognitive capabilities

and types of information

processing

Boolean networks The cell can be modeled as a

network of two-states components

interacting between them. The

state of each component in the

network depends of a particular

Boolean function.

Boolean logic

computation /

parallel processing

Expert systems The interactions (activation,

phosphorylation, etc.) between

signaling network components are

modeled using production rules.

Knowledge-based

inference and

deduction / sequential

and parallel

processing

Page 16: From Biological Systems to Computer Sciences and Back

Summary of different computational approaches

for the modeling of protein-protein interaction

Computational

approach

Idea behind the

approach

Cognitive capabilities

and types of information

processing

Cellular automata The interaction between cells or

molecules is modeled as a matrix,

where the state of an element of

the matrix depends on the states

of the neighboring elements.

None / parallel

processing

Petri nets The cell is seen as a connected

graph with two types of nodes.

One type represents elements,

such as signaling molecules and

proteins, whereas the other type

represents transitions, such as

activations.

None / sequential and

concurrent processing

Page 17: From Biological Systems to Computer Sciences and Back

Summary of different computational approaches

for the modeling of protein-protein interaction

Computational

approach

Idea behind the

approach

Cognitive capabilities

and types of information

processing

Artificial neural

networks (ANN)

The proteins in signaling networks

are seen as artificial neurons in

ANN. Like an artificial neuron, a

protein receives weighted inputs,

produces an output, and has an

activation value.

Memory, learning,

pattern recognition,

etc. / distributed,

parallel, emergent

processing

Multi-agent

systems (MAS)

The cell is seen as a collection of

agents working in parallel. The

agents communicate between

them through messages.

Memory, learning,

pattern recognition,

handling fuzzy data,

adaptive action

selection / distributed,

parallel, emergent

processing

Page 18: From Biological Systems to Computer Sciences and Back

Summary of different computational approaches

for the modeling of protein-protein interaction

Computational

approach

Idea behind the

approach

Cognitive capabilities

and types of information

processing

Continuous models The behavior of proteins

populations depend on their

physicochemical variables as

concentration and affinity. The

dynamic of the model is

represented by this variable

changes with respect to

continuous time, all this is

implemented through differential

equations.

None / distributed,

parallel, emergent

processing

Page 19: From Biological Systems to Computer Sciences and Back

Cellulat: an agent-based intracellular signaling model

González, P.P., Cárdenas, M., Camacho, D., Franyuti, A., Rosas, O. and Lagúnez J.

Instituto de Química, Universidad Nacional Autónoma de México, Ciudad Universitaria, 04510

Mexico, D.F., Mexico

Page 20: From Biological Systems to Computer Sciences and Back

Protein structure prediction

The primary structure

of a protein is its linear

sequence of amino

acids from the

beginning to the end of

the molecule

Protein structure determines protein function

The secondary structure

of a protein refers to how

individual molecules in a

protein are connected to

each other

The tertiary structure of a

protein is its three-

dimensional shape

Page 21: From Biological Systems to Computer Sciences and Back

Protein structure prediction

Protein structure prediction is the prediction of protein tertiary

structure from primary structure. The three-dimensional

structure of a protein is determined by its sequence of amino

acids. For each particular sequence of amino acids there is a

unique three-dimensional native conformation.

Protein

structure

prediction

Page 22: From Biological Systems to Computer Sciences and Back

Protein structure prediction

While the primary structure of a protein can be determined

directly from the DNA sequence of the gene which encodes it,

prediction of the three-dimensional structure of a protein from

its primary structure constitutes one of the great open

problems of molecular biology of the cell.

The knowledge of three-dimensional structure of the proteins

contributes greatly to the functional characterization of genes

and the proteins codified by genes, and the consequent

development of pharmaceutical products to affect that

function.

Page 23: From Biological Systems to Computer Sciences and Back

Protein structure prediction

Computational models for protein structure

prediction

• Bayesian models

• Hidden Markov models

• Neural networks (for secondary protein

structure prediction)

• Genetic algorithms

• Genetic programming

• Multi-agent systems

Page 24: From Biological Systems to Computer Sciences and Back

Genomics

The term “genome”

refers to the

complete set of

chromosomes and

genes of an

organism. The DNA

of higher organisms

is organized into

chromosomes. A

normal human cell

contains 23

chromosome pairs.

Genome

Chromosome

Gene

DNA base pairs

consists of

consists of

consists of

Page 25: From Biological Systems to Computer Sciences and Back

Genes

Genes are constituted of DNA. Genes are segments of DNA that encode

for proteins through the intermediate action of mRNA. The DNA constitutes

the essence of genes. A gene codifying for a particular protein corresponds

to a sequence of nucleotides along a region of a molecule of DNA.

DNA

gene genomic region

surrounding

polynucleotide

(DNA or RNA)C C A T C

A A A G C G Tpolynucleotide

(DNA or RNA)

DNA DNA

Page 26: From Biological Systems to Computer Sciences and Back

Gene expression

All manifestations of life are orchestrated by the regulated expression of the

genes, through the production of specific proteins. Some of the main

strategies for the gene expression regulation include the control of:

1. The available amount of the mRNA,

2. The translating rate of the mRNA molecules in proteins,

and

3. The activity of produced proteins

1. The transcription of the corresponding genes from DNA

into mRNA transcripts, and

2. The translation of the mRNA molecules into proteins

The gene expression involves a complex process series. Among these are:

Page 27: From Biological Systems to Computer Sciences and Back

Gene expression

Transcription

DNA

gene genomic

region

surrounding

untranscribed

sequences

mRNA transcript

Translating

protein

untranslated

regions

Page 28: From Biological Systems to Computer Sciences and Back

Measuring expression gene level

Microarray technology

Page 29: From Biological Systems to Computer Sciences and Back

Measuring expression gene level

Microarray technology

Gene expression data can come from techniques

such as microarrays, expressed sequence tag

(EST) and serial analysis of gene expression

(SAGE) (Liang, 2002). Of these techniques, the

first is now the most developed (Moreau et al.,

2002; Murphy, 2002). A microarray contains

ordered sets of oligonucleotides that permit the

determination of the expression level of genes. A

typical microarray data set includes the

expression level of genes exposed to many

experimental physiological conditions or diverse

tissues.

Page 30: From Biological Systems to Computer Sciences and Back

Gene expression analysis

Several computational tools have been developed for the

analysis of gene expression data including techniques such

as:

• Clustering algorithms (Eisen et al., 1998;

Moreau, 2002)

• Statistic methods

• Self-organizing maps (Tamayo et al., 1999)

• Machine learning (Kim et al., 2000)

Page 31: From Biological Systems to Computer Sciences and Back

Gene expression analysis:

clustering algorithms

Clustering implies grouping together objects that are similar to each

other. Many clustering techniques have been applied to the analysis of

gene expression patterns. These are useful, when they are correctly

applied to a set of gene expression profiles.

Clustering techniques can be classified as:

• Hierarchical clustering

• Non-hierarchical clustering

• Divisive clustering

• Agglomerative clustering

• Supervised clustering

• Non supervised clustering

Page 32: From Biological Systems to Computer Sciences and Back

Gene expression analysis:

hierarchical clustering algorithms

When the hierarchical clustering is used, each gene expression profile is

initially assigned to a single cluster. After that, the following iterative

process is executed: initiated in which a distance between every pair of

clusters is calculated and the two closest clusters are grouped together.

1. A distance between every pair of clusters is calculated according

to a certain distance measure (a distance matrix is generated for

all genes to be clustered).

2. The two most similar clusters are selected (initially each cluster

consists of a single gene).

3. The two closest clusters are merged, to produce a new cluster,

which contains at least two genes.

4. The distance matrix is update, calculating the distances

between this new cluster and all other clusters.

5. Steps 2 to 4 are repeated until all genes are in one cluster.

Page 33: From Biological Systems to Computer Sciences and Back

Gene expression analysis:

hierarchical clustering algorithms

The hierarchical clustering process gives rise a tree structure, which is

presented as a dendrogram. The final clusters are selected by cutting the

tree at certain level of the dendrogram.

Dendrogram

Gene expression profile

The height of

the branches

is proportional

to the distance

between the

clusters

Page 34: From Biological Systems to Computer Sciences and Back

Gene expression analysis:

self-organizing maps (SOM)

In a SOM the genes are assigned to a series of clusters on the basis of the

similarity of their expression vectors to cluster center/reference vectors.

Before initiating the analysis, the user has to predefine a topology of nodes

(typically a two dimensional rectangular or hexagonal grid, where each

node into the grid represents a cluster). Initially, random reference vector

are generated and assigned for each node. After that, an iterative process

involves the following steps:

1. A gene expression pattern is chosen at random, and the

node that maps closest to it is selected,

2. The reference vector of the selected node (in gene

expression space) is then adjusted so that it is more similar

to the gene expression vector assigned,

3. The reference vectors of the other nodes that are close to

the selected node on the two-dimensional grid are also

adjusted

Page 35: From Biological Systems to Computer Sciences and Back

Gene expression analysis:

self-organizing maps (SOM)

Microarray…

Gene expression patterns

SOM

Genes are grouped into the clusters. Each cluster is represented by the average pattern for genes in the cluster

Page 36: From Biological Systems to Computer Sciences and Back

An Evolving Neural Network for the Interpretation of Gene Expression

Patterns

Microarray Gene expression patterns

Evolving neural network (NBIA)

Genes are grouped into the clusters. Each cluster is represented by the average pattern for genes in the cluster

*Márquez M.C., **González, P.P. and ***Lagúnez, J.

*Posgrado en Ciencias de la Computación, UNAM, ***Instituto de Química, UNAM, **DEIS,

Università di Bologna

OMICS: A Journal of Integrative Biology

Jun 2005, Vol. 9, No. 2: 209-217

Page 37: From Biological Systems to Computer Sciences and Back

Gene regulation (I)

DNA

gene

Promoter

region

BS: binding site

TF: transcription factor

BS BS BS

TF TF TF

protein

DNADNA DNA

Pattern recognition

(motif finding)

Page 38: From Biological Systems to Computer Sciences and Back

Gene regulation (II)

Microarray Gene expression patterns

Cluster algorithm

Clusters

YER001W

YPL163C

YMR215W

YOR248W

YOR247W

YDR225W

YDR224C

YBL003C

YBL002W

Class 6

Genes in cluster

Motif finding

Found motif

(pattern)

Page 39: From Biological Systems to Computer Sciences and Back

Gene regulation (III)

Approaches to modeling gene regulation:

• Qualitative causal models

• Quantitative models

• Bayesian networks

• Learning matching