65
1. Lecture WS 2003/04 Bioinformatics III 1 Bioinformatics III (“Systems biology”) se will address two areas: alysis and comparison of whole genome sequences ystems biology“ – integrated view of cellular networks

Bioinformatics III (“Systems biology”)

  • Upload
    ion

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics III (“Systems biology”). Course will address two areas: analysis and comparison of whole genome sequences „systems biology“ – integrated view of cellular networks. Whole Genomes - Content. genome assembly gene finding genome alignment - PowerPoint PPT Presentation

Citation preview

Page 1: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 1

Bioinformatics III (“Systems biology”)

Course will address two areas:

- analysis and comparison of whole genome sequences

- „systems biology“ – integrated view of cellular networks

Page 2: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 2

Whole Genomes - Content

genome assembly

gene finding

genome alignment

whole genome comparison (prokaryotes, human mouse)

genome rearrangements

transcriptional regulation

functional genomics

phylogeny

single nucleotide polymorphisms (SNPs)

some topics were already covered in Bioinformatics 1lecture by Prof. Lenhof

Page 3: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 3

Cellular Networks - Content

network topologies: random networks, scale free networks

robustness of networks

expression analysis

metabolic networks, metabolic flow analysis

linear systems, non-linear dynamics

molecular systems biology: protein-protein interaction networksmolecular machines ...

Page 4: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 4

Literature

whole genome sequencese.g. David Mount, BioinformaticsChapters 6, 8, 10

system biologymostly taken from original literature

Web-resources- Institute of Systems Biology, Seattle, WA

http://www.systemsbiology.org/

- The systems biology institutehttp://www.systems-biology.org/

€ 68

Page 5: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 5

assignments

12 weekly assignments planned

Homeworks are handed out in the Tuesday lectures and are available onour webserver http://gepard.bioinformatik.uni-saarland.de on the same day. Solutions need to be returned until Tuesday of the following week 14.00in room 1.05 Geb. 17.1, first floor, or handed in prior (!) to the lecture starting at 14.15. In case of illness please send E-mail to:[email protected] and provide a medical certificate.

Page 6: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 6

Schein = successful written exam

The successful participation in the lecture course („Schein“) will be certified upon successful completion of the written exam on Feb. 18, 2004.

Participation at the exam is open to those students who have received 50% of credit points for the 12 assignments.

Unless published otherwise on the course website until Feb. 4, the exam will be based on all material covered in the lectures and in the assignments. In case of illness please send E-mail to:[email protected] and provide a medical certificate.

A „second and final chance“ exam may be offered at the beginning of April 2004 to those who failed the first exam and those who missed the first exam due to illness (medical certificate required).

Page 7: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 7

tutors

Prof. Dr. Volkhard HelmsSprechstunde: Tue 10-12. Geb. 17.1, room 1.06.Generally, I am also available after the lectures.

Dr. Tihamer Geyer – assignments for network partGeb. 17.1, room 1.09.

guest lecturers+tutors

Page 8: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 8

Bacteria Archaea EukaryaEuryarchaeota

Animals Fungi

Plants

Ciliates

Stramenophiles Trichomonads

MicrosporidiaDiplomonads

Slime molds

Entamoebae

Thermoplasma

Methanosarcina Methanobacterium MethanococcusThermococcus

Thermoproteus Pyrodictium

Crenarchaeota

Green nonsulfur bacteria

Deinococci

Aquifex

Thermotogales

Spirochetes

Flavobacteria

Cyanobacteria

Purple bacteria Gram-positive

Chlamydiae

Halophiles

Tree of Life

Page 9: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 9

Genomes

A genome is the entire genomic material of any of these biological organism.We will review genome organization, known sequences, genome language, sequencing details etc. in the next lecture.

Now that we have genome information from multiple organisms I see the following issues:1 what biological questions do we ask?2 what bioinformatics tools do we need to find the answers?3 what are the answers?

Page 10: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 10

Why mouse?

Genetecists have anxiously awaited the recently published draft version of the Mouse genome? Why?

Mouse as a close relative to humans is a unique lens through which we can view ourselves.

As the leading mammalian system for genetic research over the past century it has provided a model for human physiology and disease.

Comparative genomics makes it possible to discern biological features that would otherwise escape our notice.

Nature 420, 520 (2002)

19 mouse chromosomes.

Page 11: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 11

How do we compare genomes?

Conservation of synteny between human and mouse. 558,000 highly conserved, reciprocally unique landmarks were detected within the mouse and human genomes, which can be joined into conserved syntenic segments and blocks. A typical 510-kb segment of mouse chromosome 12 that shares common ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines connect the reciprocal unique matches in the two genomes. In general, the landmarks in the mouse genome are more closely spaced, reflecting the 14% smaller overall genome size.

Nature 420, 520 (2002)

Page 12: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 12

Genome rearrangements

Segments and blocks >300 kb in size with conserved synteny in human are superimposed on the mouse genome. Each colour corresponds to a particular human chromosome. The 342 segments are separated from each other by thin, white lines within the 217 blocks of consistent colour. Genome rearrangments have functional implications (will be discussed later).

Nature 420, 520 (2002)

Page 13: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 13

• dynamic programming: Needleman-Wunsch, Smith Waterman

• sequence alignments

• substition matrices

• significance of alignments

• BLAST, algorithmn – parameters – output http://www.ncbi.nih.gov

• This part of lecture taken from

• O’Reilly book on “BLAST” by Korf, Yandell, Bedell

• see also Bioinformatik I lecture by Prof. Lenhof

• weeks 3 and 5

Review: Pairwise sequence alignment

Page 14: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 14

Sequence alignment• When 2 or more sequences are present one would like

- to detect quantitatively their similarities

- discover equivalences of single sequence motifs

- observe regularities of conservation and variability

- deduce historical relationships

- important goal: annotation of structural and functional properties

assumption: sequence, structure, and function are inter-related.

Page 15: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 15

Search in databases• Identify similarities between

• a new test sequence, of

• unknown and uncharacterized structure and function

• and

sequences in (public) sequence databaseswith known structure and function.

•N.B. The similar regions can encompass the entire sequence or parts of it!

• Local alignment global alignment

Page 16: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

J.LeunissenBioinformatics III

Sequence Alignment

The purpose of a sequence alignment is to arrange all those residues of a deliberate number of sequences beneath eachother that are derived from the same residue position in an ancestral gene or protein.

gap = Insertion oder Deletion

Page 17: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 17

Needleman-Wunsch Algorithm- general algorithm for sequence comparison

- maximises a similarity score

- maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions

- finds the best GLOBAL alignment of any two sequences

- NW involves an iterative matrix method of calculation

all possible pairs of residues (bases or amino acids) – one from each sequence – are represented in a two-dimensional array

all possible alignments (comparisons) are represented by pathways through this array.

Three main steps 1 initialization 2 fill (induction) 3 trace-back

Page 18: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 18

Needleman-Wunsch Algorithm: Initializationtask: align words “COELACANTH” and “PELICAN” of length m=10 and n=7.

Construct (m+1) (n+1) matrix.

Assign values – m gap and – n gap to elements m and n of first row and first column. Here, gap = -1.

Arrows of these fields point back to origin.

C O E L A C A N T H0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

P -1

E -2

L -3

I -4

C -5

A -6

N -7

Page 19: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 19

Needleman-Wunsch Algorithm: FillFill all matrix fields with scores and pointers using a simple operation that requires the scores from the diagonal, vertical, and horizontal neighboring cells. Compute

- match score: value of upper left diagonal cell + score for a match (+1 or -1)

- horizontal gap score: value of cell to the left + gap score (-1)

- vertical gap score: value of cell to the top + gap score (-1)

assign maximum of these 3 scores to cell. point arrow in direction of maximum score.

max(-1, -2, -2) = -1

max(-2, -2, -3) = -2

(make arbitrary, consistent choice – e.g. always choose the diagonal over a gap.

C O E L A C A N T H0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

P -1 -1 -2

Page 20: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 20

Needleman-Wunsch Algorithm: Trace-backtrace-back lets you recover the alignment from the matrix.

start at the bottom-right corner and follow the arrows until you get to the beginning.

COELACANTH

-PELICAN--

C O E L A C A N T H0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

P -1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

E -2 -2 -2 -1 -2 -3 -4 -5 -6 -7 -8

L -3 -3 -3 -2 0 -1 -2 -3 -4 -5 -6

I -4 -4 -4 -3 -1 -1 -2 -3 -4 -5 -6

C -5 -3 -4 -4 -2 -2 0 -1 -2 -3 -4

A -6 -4 -4 -5 -3 -1 -1 1 0 -1 -2

N -7 -5 -5 -5 -4 -2 -2 0 2 1 0

Page 21: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 21

Smith-Waterman-Algorithm Smith-Waterman is a local alignment algorithm. SW is a very simple modification of Needleman-Wunsch. Only 3 changes:

- edges of the matrix are initialized to 0 instead of increasing gap penalties.

- maximum score is never less than 0. No pointer is recorded unless the score is greater than 0.

- trace-back starts from highest score in matrix and ends at a score of 0.

ELACAN

ELICANC O E L A C A N T H

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

E 0 0 0 1 0 0 0 0 0 0 0

L 0 0 0 0 2 1 0 0 0 0 0

I 0 0 0 0 1 1 0 0 0 0 0

C 0 1 0 0 0 0 2 0 0 0 0

A 0 0 0 0 0 1 0 3 2 1 0

N 0 0 0 0 0 0 0 1 4 3 2

Page 22: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 22

Differences Needleman-Wunsch Smith-Waterman

1 Global alignments 1 Local alignments

2 requires alignment score for a pair2 Residue alignment score may be

of residues to be 0 positive or negative

3 no gap penalty required 3 requires a gap penalty to work

efficiently

more suited for alignment of

eukaryotic sequences with exons and

introns

Page 23: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 23

Algorithmic complexity Dynamic programming methods such as Needleman-Wunsch and Smith-Waterman have O(mn) complexity in both time and memory.

Variation: just use 2 rows at a time and don’t allocate the whole matrix.

The alignment algorithm becomes O(n) in memory.

Page 24: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 24

Scoring - or Substition Matrices– serve to better score the quality of sequence alignments.

–for protein/protein comparison:a 20 x 20 matrix for the probabilities that certain amino acids are exchange others by random mutations

–the exchange of amino acids of similar character (Ile, Leu) is more likely (receives higher score) than for exchanging amino acids of dissimilar character (e.g. Ile Asp)

–scoring matrices are assumed to be symmetrical (exchange Ile Asp has the same probability as Asp Ile). Therefore they are triangular matrices.

Page 25: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 25

Substitution matrices

• Not all amino acids are similar– some can be replaced more easily than others

– some mutations occur more frequently than others

– some mutations are more long-lived than others

• Mutations prefer certain exchanges– some amino acids have similar 3-letter codons

– those residues are more replaced by random DNA mutation

• Selection prefers certain exchanges– some amino acids have similar properties and structure

(E.g. Trp cannot be inserted in the protein interior.)

Page 26: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 26

PAM250 Matrix

Page 27: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 27

Example ScoreThe Score of an alignment is the sum of all invidual scores of the amino acid (base) pairs of the alignment.

• Sequence 1: TCCPSIVARSN• Sequence 2: SCCPSISARNT• 1 12 12 6 2 5 -1 2 6 1 0 => Alignment Score = 46

Page 28: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 28

Dayhoff Matrix (1)– derived by M.O. Dayhoff who collected statistical data for probabilities of amino acid exchanges

– data set for closely related protein sequences (> 85% identity).

– advantage: these can be aligned to high certainty.

– derive 20 x 20 matrix for probabilities of amino acid mutations from the observed frequency of exchanges

– This matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means that 1 point mutations occur per 100 residues.

Or: both sequences are 99% identical.

Page 29: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 29

• Log odds Matrix: contains logarithms of PAM matrix entries.

• Score of mutation i j•

observed mutation rate i j = log( )

• expected mutation rate according to amino acid frequency

• The probability of two independent mutational events is the product of the individual probabilities.

• When using a log odds Matrix (i.e. using the logarithm of all values) one obtains the total alignment score as sum of the scores for every residue pair.

Dayhoff Matrix (2)

Page 30: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 30

• Derive Matrices for larger evolutionary distances by multiplying the PAM1 matrix with itself.

• PAM250:

– 2,5 mutations per residue

– corresponds to 20% matches between two sequences,

– i.e. mutations are observed at 80% of all residue positions.

– This is the default matrix of most sequence analysis packages.

Dayhoff Matrix (3)

Page 31: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 31

BLOSUM Matrix• limitation of Dayhoff-Matrix:

the matrices based on the Dayhoff model of evolutionary rates are of limited value because the substitution rates were derived from sequence alignments of sequences that are more than 85% identical.

• A different path was taken by S. Henikoff and J.G. Henikoff who used local multiple alignments of distantly related sequences.

• Advantages:

- larger data sets

- multiple alignments are more robust

Page 32: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 32

BLOSUM Matrix (2)• The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database.

• The BLOCKS database uses the concept of blocks (ungapped amino acid signatures) that are characteristic for protein families.

• Derive probabilities of exchange for all amino acid pairs from the observed mutations inside the blocks. Convert into log odds BLOSUM matrix.

• Different matrices are obtained by varying the lower requirement for the level of sequence identity.

• e.g. the BLOSUM80 matrix is derived from blocks with > 80% identity.

Page 33: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Which matrix to use?• Close relationship (low PAM, high Blosum)

Distant relationship (High PAM, low Blosum)

• reasonable default parameters: PAM250, BLOSUM62

Page 34: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Gap penalties• Besides substitution matrices we need a method to score gaps

• Which relevance do insertions or deletions have relative to substitutions?

• distinguish introduction of gaps:

• aaagaaa• aaa-aaa

• from extension of gaps:

• aaaggggaaa• aaa----aaa

• different programs (CLUSTAL-W, BLAST, FASTA) recommend different default parameters which should be used as a first guess.

Page 35: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 35

Significance of Alignments (1)

• When is an alignment statistically significant?

In other words:

• How different is the obtained score of an alignment from scores that would result from alignments of the test sequence with random sequences?

• Or:

• What is the probability that an alignment of this score occured randomly?

Page 36: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 36

Significance of Alignments (2)• size of database = 20 x 106 letters

• Peptide #hits

• A 1 x 106 (if equally distributed)

• AP 50000

• IAP 2500

• LIAP 125

• WLIAP 6

• KWLIAP 0,3

• KWLIAPY 0,015

Page 37: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 37

BLAST – Basic Local Alignment Search Tool

• finds the highest-scored local optimal alignment of a test sequence with all sequences of a database.

• Very fast algorithm. Ca. 50 times faster than dynamical programming.

• because BLAST uses pre-indexed database, BLAST can be used to search very large databases.

• is sufficiently sensitive and selective for most purposes.

• Is robust – default parameters usually work fine.

Page 38: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 38

BLAST Algorithm, Step 1

• For given word of length w (usually 3 for proteins) and for a given scoring matrix

• construct list of all words (w-mers) which get score > T if compared to w-mer of input sequence.

P D G 13

P Q A 12 P Q N 12etc.

belowcut-off (T=13)

test sequence L N K C K T P Q G Q R L V N QP Q G 18P E G 15 P R G 14P K G 14 P N G 13

related words

word

P M G 13

Page 39: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 39

BLAST Algorithm, Step 2

• each related word points to positions in data base (hit list).

P D G 13

P Q G 18P E G 15 P R G 14P K G 14 P N G 13

P M G 13 PMG Database

Page 40: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 40

BLAST Algorithm, Step 3

• Program tries to extend suitable segments (seeds) in both directions by adding pairs of residues.

• Residues are added until score sinks below cut-off.

Page 41: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 41

different BLAST algorithms• BLASTN – compares nucleotide sequence against nucleotide database

• BLASTP – compares protein sequence against protein database

• BLASTX – compares nucleotide sequences translated in all 6 open reading frames against protein sequence database

• TBLASTN

• TBLASTX

Page 42: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 42

• Small probability shows that hit is likely not random

BLAST Output (1)

Page 43: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 43

Significance of BLAST alignmentP-value (probability)

– probability that alignment score could result from alignment of random sequences

– the closer P equals 0, the higher is certainty that a hit is a true hit (homologous sequence)

E-value (expectation value)– E = P * number of sequences in database

– E is the number of alignments of a particular score that can be expected to occur randomly in a sequence database of this size

– if e.g. E=10, one expects 10 random hits with the same score. Such an alignment is not significant. Use appropriate threshold in BLAST.

Page 44: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 44

Rough guide•P-value (probability) – A. M. Lesk

– P 10-100 sequences are identical

– 10-100 < P < 10-50 sequences are almost identical, e.g. alleles or SNPs

– 10-50 < P < 10-10 closely related sequences,

homology is certain

– 10-10 < P < 10-1 sequences are usually distantly related

– P > 10-1 similarity probably not significant

•E-value (expectation value)• E 0,02 sequences probably homologous

• 0,02 < E < 1 Homology possible

• E 1 good agreement most likely random.

Page 45: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 45

Rough guide

Level of sequence identity with optimal alignment

> 45% proteins have very similar structure and most likely the same function

> 25% proteins probably possess similar fold

18 – 25% Twilight-Zone - assuming homology is tempting

below alignment has little significance

Page 46: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 46

Twilight-Zone (1)• myoglobin from whale and Leghemoglobin of lupins are 15% identical with optimal alignment

• both have very similar tertiary structure. • both contain heme group and bind oxygen they are remotely related, though homologous proteins

Left: Whale Mb

Right:Leg Hb

www.rcsb.org

Page 47: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 47

Twilight-Zone (2)

• the N- and C-terminal halfs of thiosulfate-sulfate-transferase have 11% sequence identity. • Because they belong to the same protein assumption that they resulted from gene duplication and divergent evolution.

• Indeed both 3D structures show large similarity.

2ORARhodanese

www.rcsb.org

Page 48: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 48

Twilight-Zone (3)• serine proteases chymotrypsin and subtilisin• have 12% identity with optimal alignment

• both have same function, same catalytic triad of 3 amino acids (Ser – His – Asp)

• However, the two folds are completely different and the proteins are not related. Example for convergent evolution.

Left: 1AB9-BovineChymoTrypsin

Right: 1GCIBacillus LentusSubtilisin

www.rcsb.org

Page 49: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 49

Summary

Pairwise alignment of sequences is routine but not trivial.

Dynamic programming guarantees finding the alignment with optimal score (Smith-Waterman, Needleman-Wunsch).

Much faster but reliable tools are: FASTA, (PSI) BLAST

Deeper functional insight into sequences and relationships from multiple sequence alignments (see lecture on phylogenies).

Page 50: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 50

Growth of Proteomic Data vs. Sequence Data

0.00001

0.0001

0.001

0.01

0.1

1

10

100

1000

1988

1990

1992

1994

1996

1998

2000

2002

2004

2006

2008

2010

2012

2014

2016

Years

Peta

Byt

es

Proteomic data

GenBank

Page 51: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 51

Systems biology

Systems biology is an emergent field that aims at system-level understanding of biological systems.

Cybernetics, for example, aims at describing animals and machines from the control and communication theory. Unfortunately, molecular biology had just started at that time, so that only phenomenological analysis has been possible.

With the progress of genome sequence project and range of other molecular biology project that accumulate in-depth knowledge of molecular nature of biological system, we are now at the stage to seriously look into possibility of system-level understanding solidly grounded on molecular-level understanding.

http://www.systems-biology.org/000/

Page 52: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 52

Systems biology

What does it mean to understand at "system level"? Unlike molecular biology which focusses on molecules, such as the sequences of nucleotide acids and proteins, systems biology focusses on systems that are composed of molecular components.

Although systems are composed of matters, the essence of systems lies in the dynamics and cannot be described merely by enumerating components of the system.

At the same time, it is misleading to believe that only the system structure, such as network topologies, is important without paying sufficient attention to diversities and functionalities of components.

Both the structure of the system and its components play indispensable roles forming a symbiotic state of the system as a whole.

http://www.systems-biology.org/000/

Page 53: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 53

Systems biology

Key milestones are:(1) understanding of structure of the system, such as gene regulatory and biochemical networks, as well as physical structures, (2) understanding of dynamics of the system, both quantitative and qualitative analysis as well as construction of theory/models with powerful prediction capability, (3) understanding of control methods of the system, and (4) understanding of design methods of the system.

There are numbers of exciting and profound issues that are actively investigated, such as robustness of biological systems, network structures and dynamics, and applications to drug discovery. Systems biology is in its infancy, but this is the area that has to be explored and the area that we believe to be the main stream in biological sciences in this century.

http://www.systems-biology.org/000/

Page 54: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 54

Systems Biology

Nat. Biotech. Nov. 2000, 1147

Page 55: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 55

The relationship between the genotype and the phenotype is complex, highlynon-linear and cannot be predicted from simply cataloging and assigning genefunctions to genes found in a genome.

http://gcrg.ucsd.edu/presentations/hougen/l2.pdf

From Genomics to Genetic Circuits

Page 56: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 56

Genetic Circuits Engineering

Page 57: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 57

Analysis of Genetic Circuits

Page 58: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 58

Reconstructing Metabolic Networks

Page 59: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 59

Translating Biochemistry into Linear Algebra

Page 60: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 60

DOE initiative: Genomes to Lifea coordinated effort

slides borrowedfrom talk ofMarvin FrazierLife Sciences DivisionU.S. Dept of Energy

Page 61: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 61

Facility IProduction and Characterization of Proteins

Estimating Microbial Genome Capability• Computational Analysis

– Genome analysis of genes, proteins, and operons– Metabolic pathways analysis from reference data– Protein machines estimate from PM reference data

• Knowledge Captured– Initial annotation of genome– Initial perceptions of pathways and processes– Recognized machines, function, and homology– Novel proteins/machines (including

prioritization)– Production conditions and experience

Page 62: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 62

• Analysis and Modeling– Mass spectrometry expression analysis– Metabolic and regulatory pathway/ network

analysis and modeling

• Knowledge Captured– Expression data and conditions– Novel pathways and processes– Functional inferences about novel

proteins/machines– Genome super annotation: regulation, function,

and processes (deep knowledge about cellular subsystems)

Facility II Whole Proteome Analysis

Modeling Proteome Expression, Regulation, and Pathways

Page 63: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 63

Facility III Characterization and Imaging of Molecular Machines

Exploring Molecular Machine Geometry and Dynamics

• Computational Analysis, Modeling and Simulation– Image analysis/cryoelectron microscopy– Protein interaction analysis/mass spec– Machine geometry and docking modeling– Machine biophysical dynamic simulation

• Knowledge Captured– Machine composition, organization, geometry,

assembly and disassembly– Component docking and dynamic simulations

of machines

Page 64: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 64

Facility IVAnalysis and Modeling of Cellular Systems

Simulating Cell and Community Dynamics

• Analysis, Modeling and Simulation– Couple knowledge of pathways, networks, and

machines to generate an understanding of cellular and multi-cellular systems

– Metabolism, regulation, and machine simulation

– Cell and multicell modeling and flux visualization

• Knowledge Captured– Cell and community measurement data sets– Protein machine assembly time-course data sets– Dynamic models and simulations of cell processes

Page 65: Bioinformatics III (“Systems biology”)

1. Lecture WS 2003/04

Bioinformatics III 65

GTL Computing Roadmap

Biological Complexity

ComparativeGenomics

Constraint-BasedFlexible Docking

Com

putin

g an

d In

form

atio

n In

fras

truc

ture

Cap

abili

ties

Constrained rigid

docking

Genome-scale protein threading

Community metabolic regulatory, signaling simulations

Molecular machine classical simulation

Protein machineInteractions

Cell, pathway, and network

simulation

Molecule-basedcell simulation

Current U.S. Computing