114
Bioinformatics Bioinformatics and sequence analysis and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

BioinformaticsBioinformaticsand sequence analysisand sequence analysis

Michael NilgesUnité de Bio-Informatique Structurale

Institut Pasteur, Paris

Mars 2002

Page 2: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

2

OverviewOverview

• Bioinformatics: a brief overview

• Organising knowledge: databanks and databases

• Protein sequence analysis– Sequence alignment– Multiple alignment and sequence pofiles– Phylogenetic trees

Page 3: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

I. Bioinformatics - a brief overviewI. Bioinformatics - a brief overview

Page 4: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

4

What is it?What is it?

Bioinformatics:

Deduction of knowledge by computer analysis

of biological data.

or: see 20000 pages on this issue on the WWW

Page 5: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

5

The dataThe data

• information stored in the genetic code (DNA)• protein sequences• 3D structures• experimental results from various sources • patient statistics• scientific literature

Page 6: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

6

Algorithmic developmentsAlgorithmic developments

• Important part of research in bioinformatics:

methods for

data storage

data retrieval

data analysis

Page 7: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

7

Interdisciplinary researchInterdisciplinary research

• rapidly developing branch of biology

• highly interdisciplinary:

using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics.

• many practical applications in biology and medicine.

Page 8: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

8

Computation in biology...Computation in biology...

• similar to other sciences:

computational physics, computational chemistry

derivation of physics laws from astronomical data• already in the '20s biologists wanted to derive

knowledge by induction• reasons for recent development:

development of computers and networks

availability of data (sequences, 3D structures)

amount of data

Page 9: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

9

Why?Why?

• An avalanche of data:• Sequences• Function related• Structures

• requires computational approaches

Page 10: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

10

GenomicsGenomics

• New way to perform experiments: • accumulation of data: • sequences• structures, • function-related

• not hypothesis-driven

• Hypothesis formed later and tested in silico

Page 11: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

11

Bioinformatics key areasBioinformatics key areas

organisation of knowledge (sequences, structures, functional data)

e.g. homology searches

Page 12: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

12

Structural BioinformaticsStructural Bioinformatics

• Prediction of structure from sequence– secondary structure– homology modelling, threading– ab initio 3D prediction

• Analysis of 3D structure– structure comparison/ alignment– prediction of function from structure– molecular mechanics/ molecular dynamics– prediction of molecular interactions, docking

• Structure databases (RCSB)

Page 13: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

13

Structural BioinformaticsStructural Bioinformatics

Page 14: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

II. DatabasesII. Databases

Page 15: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

15

Organizing knowledgeOrganizing knowledgein databanks and databasesin databanks and databases

• Introduction

• Sequence databanks and databases– EMBL, SwissProt, TREMBL

• SRS: Sequence Retrieval system

• 3D structure database: the RCSB - PDB

• Domain databases

Page 16: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

16

Biological databanks and databasesBiological databanks and databases

• Very fast growth of biological data• Diversity of biological data:

– primary sequences– 3D structures– functional data

• Database entry usually required for publication– Sequences– Structures

• Database entry may replace primary publication– genomic approaches

Page 17: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

17

DNA sequence data basesDNA sequence data bases

• Three databanks exchange data on a daily basis• Data can be submitted and accessed at either location

• Genebank– www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html

• EMBL– www.ebi.ac.uk/embl/index.html

• DNA DataBank of Japan (DDBJ)– www.nig.ac.jp/home.html

Page 18: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

18

EMBL database growthEMBL database growth

Page 19: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

19

Distribution of entriesDistribution of entries

Page 20: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

20

EMBL database documentationEMBL database documentation

• Information on• user manual• release notes• feature table definition... see

• http://www.ebi.ac.uk/embl/Documentation

Page 21: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

21

EMBL entry for insulin receptorEMBL entry for insulin receptor

Page 22: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

22

EMBL entry 2: featuresEMBL entry 2: features

Page 23: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

23

EMBL entry 3: sequenceEMBL entry 3: sequence

Page 24: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

24

SwissProt protein sequence data baseSwissProt protein sequence data baseTREMBL translated EMBLTREMBL translated EMBL

• hosted jointly by EBI (European Bioinformatics Institute, an EMBL outstation in Hinxton, UK) and SIB (Swiss Institute for Bioinformatics in Lausanne and Geneva)

• SwissProt is curated (Amos Bairoch)– quality checks– annotations– links to other databases

• TREMBL: automatic translation of EMBL– automatic annotations

Page 25: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

25

ExPASy - www.expasy.orgExPASy - www.expasy.orgExExpert pert PProtein rotein AAnalysis nalysis SySystemstem

Page 26: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 27: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 28: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 29: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

29

FASTA formatFASTA format

one line header, starting with >some programs require several characters without space after >

sequence, in free format (no numbers)

Page 30: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

SWISS-PROT entry for insulin receptorSWISS-PROT entry for insulin receptor(NiceProt view)(NiceProt view)

Page 31: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

31

Features of insulin receptorFeatures of insulin receptor

Page 32: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

32

Niceprot Feature AlignerNiceprot Feature Aligner

Page 33: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

33

Clustalw-alignment of two domainsClustalw-alignment of two domains

Page 34: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

34

Links to other sites (Blast, ...)Links to other sites (Blast, ...)

Page 35: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

35

The RCSB-PDBThe RCSB-PDBwww.rcsb.org/pdbwww.rcsb.org/pdb

• Data bases for 3D structures of biological macromolecules (proteins, nucleic acides)

• RCSB (Research Collaboratory for Structural Bioinformatics) maintains and develops the PDB (Protein Data Bank)

• others:– MMDB (EBI): msd.ebi.ac.uk– NCBI: www.ncbi.nlm.nih.gov/Structure/

Page 36: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

www.rcsb.org/pdbwww.rcsb.org/pdb

Page 37: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

37

Results of a simple queryResults of a simple query

Page 38: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 39: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

39

View structuresView structures

Page 40: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 41: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

41

Domain databasesDomain databases

• Pfam (A/B) www.sanger.ac.uk/Pfam

• Smart smart.embl-heidelberg.de

• Prodom prodes.toulouse.inra.fr/prodom/doc/prodom.html

• Dart www.ncbi.nlm.nih.gov/Structure/lexington/lexington.cgi?cmd=rps

• Interpro www.ebi.ac.uk/interpro/

Page 42: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

42

InterProInterPro

• InterPro release 4.0 (Nov 2001) was built from

• Pfam 6.6, PRINTS 31.0, PROSITE 16.37, ProDom 2001.2,

SMART 3.1, TIGRFAMs 1.2,

SWISS-PROT + TrEMBL data.

• 4691 entries: 1068 domains, 3532 families, 74 repeats and 15 post-translational modification sites.

Page 43: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

43

Results of InterPro search for spectrinResults of InterPro search for spectrin

Page 44: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

44

Spectrin repeatSpectrin repeat

Page 45: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

45

SMART databaseSMART database

Page 46: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

46

Domain architecture of spectrin beta chainDomain architecture of spectrin beta chain

Page 47: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

47

Pfam home pagePfam home page

Page 48: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

48

Compilations of links to databasesCompilations of links to databases

at Institut Pasteur

www.pasteur.fr/recherche/banques

at Infobiogen (Evry)

www.infobiogen.fr/services/deambulum/fr

European bioinformatics institute (ebi)

www.ebi.ac.uk/Databases/index.html

at the swiss institute for bioinformatics (SIB)

www.expasy.org

www.expasy.org/alinks.html#Proteins

Page 49: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

49

SRS sequence retrieval systemSRS sequence retrieval system

• unified way to access and link information in different databases

• powerful queries• launch applications (e.g. blast, clustalw...)• temporary and permanent projects

• can be reached from the pasteur databank page• srs.pasteur.fr/cgi-bin/srs6/wgetz

Page 50: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

50

SRS 6 start pageSRS 6 start page

Page 51: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

51

SRS access to databasesSRS access to databases

Page 52: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

52

SRS quick searchSRS quick search

Page 53: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

53

SRS queriesSRS queries

• queries by simple words• extension of words by wildcards• linked by logical operators (and, or, &, ...)• standard query form has 4 entry fields• display list can be customized

Page 54: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

54

Standard SRS queryStandard SRS query

Page 55: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

55

Query resultQuery result

Page 56: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

56

Linking information with SRSLinking information with SRS

Page 57: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

57

Results of linkResults of link

Page 58: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

III. Sequence alignmentIII. Sequence alignment

Page 59: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

59

Sequence alignmentSequence alignment

• Alignment scoring and substitution matrices• Aligning two sequences

– Dotplots– The dynamic programming algorithm– Significance of the results

• Heuristic methods– FASTA– BLAST– Interpreting the output

Page 60: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

60

Sequence formatsSequence formats

• Examples:• Staden: simple text file, lines <= 80 characters• FASTA: simple text file, lines <= 80 characters, one line

header marked by ">"• GCG: structured format with header and formatted

sequence

• Sequence format descriptions e.g. on http://www.infobiogen.fr/doc/tutoriel/formats.html

Page 61: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

61

GCG sequence formatGCG sequence format

Page 62: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

62

GCG database formatGCG database format

• comments: up to"..."• signal line with idetifier "Check ...."• sequence

Page 63: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

63

Format conversionsFormat conversions

• in GCG: specific command to convert from different formats (e.g., fromstaden)

• readseq: • general conversion program• available on www at pasteur

Page 64: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

64

Protein sequence alignmentProtein sequence alignment(DNA alignment is analogous)(DNA alignment is analogous)

• Local sequence comparison:

• assumption of evolution by point mutations– amino acid replacement (by base replacement)– amino acid insertion– amino acid deletion

• scores:– positive for identical or similar– negative for different– negative for insertion in one of the two sequences

Page 65: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

65

Comparing two sequences: DotPlotComparing two sequences: DotPlot

• Simple comparison without alignment

• Similarities between sequences show up in 2D diagram

Page 66: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

66

Dotplot for a small protein against itselfDotplot for a small protein against itself

identity (i=j)

similarity of sequencewith other parts of itself

Page 67: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

67

Dotplot for two remotely homologous proteinsDotplot for two remotely homologous proteins

Page 68: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Dotplot for protein with internal repeatsDotplot for protein with internal repeats

Page 69: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

69

Spectrin domain structureSpectrin domain structure

Page 70: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

70

3 alignments of globin sequences:3 alignments of globin sequences:right or wrong?right or wrong?

Page 71: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

71

Alignment scoringAlignment scoring

• the 1st alignment: highly significant• the 2nd: plausible• the 3rd: spurious

• distinguish by alignment score

• similarities increase score• mismatches decrease score• gaps decrease score

substitution matrix

gap penalties

Page 72: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

72

Substitution matricesSubstitution matrices

• Substitution matrix weights replacement of one residue by another:

– similar -> high score (positive)– different -> low score (negative)

• simplest is identity matrix (e.g. for nucleic acids)

A C G T

A 1 0 0 0

C 0 1 0 0

G 0 0 1 0

T 0 0 0 1

Page 73: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

73

Derivation of substitution matricesDerivation of substitution matricesPAM matricesPAM matrices

• PAM matrix series (PAM1 ... PAM250): – derived from alignment of very similar sequences– PAM1 = mutation events that change 1% of AA– PAM2, PAM3, ... extrapolated by matrix multiplication– e.g.: PAM2 = PAM1*PAM1; PAM3 = PAM2 * PAM1 etc

• Problems with PAM matrices:– incorrect modelling of long time substitutions, since– conservative mutations dominated by single nucleotide change– e.g.: L <–> I, L <–> V, Y <–> F– long time: any AA change

Page 74: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

positive and negative valuesidentity score depends on residue

Page 75: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

75

BLOSUM matricesBLOSUM matrices

• BLOSUM series (BLOSUM50, BLOSUM62, ...)• derived from alignments of distantly related sequence

• BLOCKS database: – ungapped multiple alignments of protein families

at a given identity

• BLOSUM50 better for gapped alignments• BLOSUM62 better for ungapped alignments

Page 76: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Blosum62 substitution matrixBlosum62 substitution matrix

Page 77: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

77

Gap penaltiesGap penalties

• significance of alignment:• depends critically on gap penalty

• need to adjust to given sequence

• gap penalties influenced by knowledge of structure etc

• simple rules when nothing is known (linear or affine)

Page 78: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

78

Gap penaltiesGap penalties

• linear gap penalty: one constant d for each insertion g• (g) = - g * d with g length of gap

• affine gap penalty:– (large) penalty d for opening of gap– (smaller) penalty e for extension of existing gap

• (g) = - d - (g-1) * e, with g = length of gap

• example d = 10, e = 0.2

Page 79: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

79

Alignment of two sequencesAlignment of two sequences

Page 80: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

80

Alignment algorithmsAlignment algorithms

• maximize score:

match as many positively scoring pairs as possible• minimize cost:

reduce number of mismatches and number of gaps• possibilities to align 2 sequences of length n:

2n

n

⎝ ⎜ ⎜

⎠ ⎟ ⎟ =

(2n)!(n!)

≈22n

πn

Page 81: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

81

Dynamic programming algorithmDynamic programming algorithm

• dynamic programming =

build up optimal alignment

using previous solutions

for optimal alignments of subsequences

Page 82: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

82

Dynamic programming algorithmDynamic programming algorithm

• define a matrix Fij:

Fij is the optimal alignment of

subsequence A1...i and B1...j

• iterative build up: F(0,0) = 0• define each element i,j from

(i-1,j): gap in sequence A

(i, j-1): gap in sequence B

(i-1, j-1): alignment of Ai to Bj

Page 83: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

83

Dynamic programmingDynamic programming

Page 84: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

84

Scores from substitution matrixScores from substitution matrix

Page 85: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

85

(1) Initialize boundaries(1) Initialize boundaries

Page 86: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

86

(2) Fill matrix with minimum score sums..(2) Fill matrix with minimum score sums..

Page 87: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

87

from top left cornerfrom top left corner

Page 88: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

88

Filled matrix: score in right bottom cornerFilled matrix: score in right bottom corner

Page 89: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

89

(3) Backtracing gives alignment(3) Backtracing gives alignment

Page 90: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

90

Alternative optimum alignmentAlternative optimum alignment

Page 91: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

91

Alignment algorithmsAlignment algorithms

• global alignment (ends aligned)– Needleman & Wunsch, 1970

• local alignment (subsequences aligned)– Smith & Waterman, 1981

• searching for repetitions

• searching for overlap

Page 92: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

92

Example output of GCG program bestfitExample output of GCG program bestfit

• alignment score depends on score matrix

• percent similarity - percent identity

• affine gap penalty favours grouping of gaps

Page 93: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 94: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

94

Database searches: FASTA and BLASTDatabase searches: FASTA and BLAST

• Full Smith-Waterman search expensive (O(mn))

• database contains > 100 million residues

• heuristic programs concentrate on important regions

• evaluate few cell in the dynamic programming matrix

Page 95: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

95

FASTAFASTA

• multi-step approach to find high-scoring alignments

• (1) exact short word matches• (2) maximal scoring ungapped extensions• (3) identify gapped alignments

Page 96: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

96

• lookup table to find all identically matching words

• length ktup• ktup = 1,2 for proteins• ktup = 4-6 for DNA

Page 97: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

97

• Scoring the words with the substitution matrix

Page 98: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

98

• extend exact word matches to find maximal scoring ungapped regions

Page 99: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

99

• join ungapped regions in one gapped region

• highest scoring candidate matches are realigned in a narrow band around match

Page 100: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

100

BLASTBLAST

• multi-step approach to find high-scoring alignments

• (1) list words of fixed length (3AA) expected to give score larger than threshold

• (2) for every word, search database and extend ungapped alignment in both directions

• (3) new versions of BLAST allow gaps

Page 101: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

101

BLAST program suiteBLAST program suite

• various versions:

• blastn: nucleotide sequences• blastp: protein sequences• tblastn: protein query - translated database• blastx: nucleotide query - protein database• tblastx: nucleotide query - translated database

Page 102: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

http://www.ncbi.nlm.nih.gov/BLAST

Page 103: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

103

Multiple sequence alignmentMultiple sequence alignmentand sequence profilesand sequence profiles

• Scoring a multiple sequence alignment

• An alignment algorithm: CLUSTALW

• Sequence profiles and profile searches

Page 104: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

104

Multiple sequence alignmentMultiple sequence alignment

• compare set of sequences• align homologous residues in columns

• homologous residues:– evolutionary: diverge from common ancestral residue– structurally: occupy similar position in space

• generally impossible to get single "correct" alignment• focus on key residues and align them in columns

Page 105: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

105

Example: part of haemoglobin alignmentExample: part of haemoglobin alignment

Page 106: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002
Page 107: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

107

Scoring multiple sequence alignmentScoring multiple sequence alignment

• take into account:• (1) some positions more conserved than others• (2) sequences are not independent but related in a

phylogenetic tree

• approximation: assume columns of alignment are statistically independent

• total score of alignment is sum of column scores• each column score is a sum of all sequence pairs

Page 108: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

108

Multiple sequence alignment algorithmsMultiple sequence alignment algorithms

• multidimensional dynamic programming– very expensive, only possible for few sequences

• progressive alignment methods– construct a series of pair-wise alignments

Page 109: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

109

CLUSTALW and CLUSTALXCLUSTALW and CLUSTALX

• align all sequence pairs by dynamic programming• convert alignment into evolutionary distances• construct a "guide tree" • align nodes of the tree in order of decreasing similarity

– sequence-sequence– sequence-profile– profile-profile alignment

Page 110: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

110

Guide treeGuide tree

• Guide tree is a "quick and dirty" phylogenetic tree

• Clustal alignment starts at the right (the leaves)

• progresses to the left

• aligned sequences: • sequence profile

Page 111: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

111

CLUSTALCLUSTAL

• other important features:

• sequences are weighted to compensate for bias• substitution matrix depending on expected similarity:• similar sequences with "hard" matrices (BLOSUM80)• distant sequences with "soft" matrices (BLOSUM50)• position specific gap open penalties

Page 112: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

112

Sequence profilesSequence profiles

• multiple sequence alignment -> sequence profile

• evolutionary relationship

• "sequence-specific substitution matrix"

• very sensitive database searches

Page 113: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

113

Sequence profileSequence profile

Page 114: Bioinformatics and sequence analysis Michael Nilges Unité de Bio-Informatique Structurale Institut Pasteur, Paris Mars 2002

Bioinformatics lecture March 5, 2002

114

Profile searchesProfile searches

• "by hand": – database search (Smith-Waterman)– multiple sequence alignment– calculation of profile– profile database search

• possible at http://eta.embl-heidelberg.de:8000

• less sensitive but much easier: psi-blast at NCBI