Computing in Molecular Biology

Sequence Comparison& Alignment

Computing in Molecular Biology

Hugues Sicotte

National Center for Biotechnology Information

[email protected]


C O M P A R A T I V E A N A L Y S I S

Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle

Sequence alignment is similar to other types of comparative analysis

Involves scoring similarities and differences among a group of related entities


Homology

Homology Is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.”

David B. Wake



Alignment algorithms model evolutionary processes

GATTACCA

GATGACCA GATTACCA

Derivation from a common ancestor through incremental change due to dna replication errors, mutations, damage, or unequal crossing-over.

insertion

GATCATCA GATTGATCA

GATTACCA GATTATCA GATTACCA

deletionSubstitution

GAT ACCA

T




GATTACCA

GATGACCA GATTACCA

Derivation from a common ancestor through incremental change

GATCATCA GATTGATCA


GATACCA

Only extant sequences are known, ancestral sequences are postulated.

GATCATCA GATTGATCA

GATTACCA

GATACCA


The term homology implies a common ancestry, which may be inferred from observations of sequence similarity



GATTACCA

GATGACCA GATTACCA

Derivation from a common ancestor through incremental change. Mutations that do not kill the host may carry over to the population. Rarely are mutations kept/rejected by natural selection.

GATCATCA GATTGATCA


GATACCA


Comparative Analysis of Genes

MSH2_Human TGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATKSPE1_DROME VGTAVLMAHIGAFVPCSLATISMVDSILGRVGASDNIIKGLSTFMVEMIETSGIIRTATDMSH2_Yeast VGVISLMAQIGCFVPCEEAEIAIVDAILCRVGAGDSQLKGVSTFMVEILETASILKNASKMUTS_ECOLI TALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTETANILRNATE *** ** ** * * **** **** * ** * *

HumanBacteria Yeast Worm Fly Mouse

3000Myr

1000Myr

500Myr

Human Colon Cancer MSH2 gene is homologous to DNA repair proteins

Align Extant Sequences


Why Align sequences?

- Finding similar sequences helps determine the properties and function of a new sequence. (Must be verified experimentally)

-Conserved positions in homologous sequences hint at functionally important sites in proteins. (active or catalytic sites, dna binding domains, di-sulfide bridges, structural bends, hydrophobic pockets, protein binding domains,…)

-Conserved nucleotides can hint at regulatory elements, either pre-transcriptional or post-transcriptional.

Sequence Comparison& AlignmentSound alignment methods reflect evolution.

DNA Evolution:

- Mutation: Errors in DNA replication of DNA repair.

-substitutions: replacement of one base by another.

-deletions/insertions: By dna mispairing during replication or unequal crossing over.

- Gene conversion or unequal crossing over: Large

segments of DNA can be inserted/deleted.

- Mutations that do not kill the host are propagated. Sometimes positive mutations are selected for.

Reference: Molecular Evolution: Wen-Hsiung Li, 1997,Sinauer Associates publishing


Synonymous versus non-synonymous mutations

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Co

din

g

No

n-C

od

ing

Pse

ud

og

enes

5' Flank

3' Flank

introns

5'UTR

3'UTR

non-denerate

Twofolddegenerate

4-folddegenerate

Pseudogenes

Substitution rate per nucleotide site per billion years.

Different regions evolve at

different rates, consistent with

evolutionary constraints.


Alignment definition and Type:

G-ATES

GRATED

Local Alignments:

Global Alignment:

Alignment:

All bases aligned with another base or with a gap (symbol of “-” or sometimes “.”).

Each Base is used at most once.

Do not need to align all the bases in all sequences.

Align BILLGATESLIKESCHEESE and GRATEDCHEESE

G-ATESLIKESCHEESE or G-ATES & CHEESE

GRATED-----CHEESE GRATED & CHEESE



GATTATACCAGATTA---CA

Insertions and deletions (‘indels’) are represented by gaps in alignments

gap of length 3


S-S

S E Q U E N C E A L I G N M E N T

S-S

S-S

An alignment provides a mapping of residues in one sequence onto those of another

Conserved residues are often of structural or functional importance

Alignment of trypsin sequences from mouse and crayfish

Figure 7.1

*Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQVCrayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI

*Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTACrayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ

Mouse PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLECrayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE

*Mouse GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANCrayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV--


Conserved positions are often of functional importance.

Alignment of trypsin proteins of mouse (Swiss-Prot P07146) and crayfish (Swiss-Prot P00765). Identical residues are highlighted red and underlined.

Indicated above the alignment are three disulfide bonds (-S-S-) whose participating cysteine residues are conserved, amino acids whose side chains are involved in the charge relay system (asterisk) and the active side residue which governs substrate specificity (diamond). The other conserved positions

have no known role. These conserved residues could be coincidentally conserved or have some unknown structural role.


Figure 7.1

S-S

S-S

S-S

Alignment of trypsin sequences from mouse and crayfish

*Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQVCrayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI

*Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTACrayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ

Mouse PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLECrayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE

*Mouse GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAANCrayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV--



Figure 7.2

CLUSTAL W (1.7) multiple sequence alignment

Human-Zcr MATGQKLMRAVRVFEFGGPEVLKLRSDIAVPIPKDHQVLIKVHACGVNPVETYIRSGTYSEcoli-QOR ------MATRIEFHKHGGPEVLQA-VEFTPADPAENEIQVENKAIGINFIDTYIRSGLYP : :...:.******: ::: . * :::: :: :* *:* ::****** *.

Human-Zcr RKPLLPYTPGSDVAGVIEAVGDNASAFKKGDRVFTSSTISGGYAEYALAADHTVYKLPEKEcoli-QOR -PPSLPSGLGTEAAGIVSKVGSGVKHIKAGDRVVYAQSALGAYSSVHNIIADKAAILPAA * ** *::.**::. **.... :* ****. :.: *.*:. ... **

Human-Zcr LDFKQGAAIGIPYFTAYRALIHSACVKAGESVLVHGASGGVGLAACQIARAYGLKILGTAEcoli-QOR ISFEQAAASFLKGLTVYYLLRKTYEIKPDEQFLFHAAAGGVGLIACQWAKALGAKLIGTV :.*:*.** : :*.* * :: :*..*..*.*.*:***** *** *:* * *::**.

Human-Zcr GTEEGQKIVLQNGAHEVFNHREVNYIDKIKKYVGEKGIDIIIEMLANVNLSKDLSLLSHGEcoli-QOR GTAQKAQSALKAGAWQVINYREEDLVERLKEITGGKKVRVVYDSVGRDTWERSLDCLQRR ** : : .*: ** :*:*:** : ::::*: .* * : :: : :.. . .:.*. *.:

Human-Zcr GRVIVVG-SRGTIEINPRDTMAKES----SIIGVTLFSSTKEEFQQYAAALQAGMEIGWLEcoli-QOR GLMVSFGNSSGAVTGVNLGILNQKGSLYVTRPSLQGYITTREELTEASNELFSLIASGVI * :: .* * *:: . : ::. : .: : :*:**: : : * : : * :

Human-Zcr KPVIGSQ--YPLEKVAEAHENIIHGSGATGKMILLLEcoli-QOR KVDVAEQQKYPLKDAQRAHE-ILESRATQGSSLLIP * :..* ***:.. .*** *:.. .: *. :*:

Stars indicate identical residues and dots indicate conservative substitutions

Human zeta crystallin vs E.coli quinone oxidoreductase


Score and Statistics

G-ATESLIKESCHEESE AND/OR G-ATES & CHEESE

GRATED-----CHEESE GRATED & CHEESE

Percent Identity. Can be misleading.

Score: A simple quality measure is the “score”. The score assigns points for each aligned base (or gap) of the alignment.

identical bases : “match” score

mismatching bases: “mismatch” score

gaps: “gap opening” penalty for starting a gap

“gap extension” penalty for each gap symbol.

Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1))

= -7

Example: match = +1 , mismatch =-1,

gap opening = -5, gap extension=-1


S C O R I N G S Y S T E M S

Which alignment is “better”?

GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC

GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC

0 mismatches, 5 gaps

3 mismatches, 1 gap



High penalty for “opening” a gap

(e.g. G = 5)

GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC

GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC

Penalty = 5G + 6L = 31

Penalty = 1G + 6L = 11

Lower penalty for “entending” a gap

(e.g. L = 1)


L O C A L S I M I L A R I T Y

Figure 7.3

F12 F2 E F1 E K Catalytic

PLAT F1 E K CatalyticK

Mix-and-match protein modules confound alignment algorithms

Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT)

F1,F2 Fibronectin repeatsE EGF similarity domainK Kringle domainCatalytic Serine protease activitiy



Figure 7.3






modules inreverse order



Figure 7.3






repeatedmodules


D O T P L O T S

Figure 7.4

Dot-plot Fitch : Biochem. Genet. (1969)3,99-108

A

C

G

T

C G T A C C G T

0 0 0 1 0 0 0 0

1

0

0

0 0 0 1 1 0 0

1 0 0 0 0 1 0

0 1 0 0 0 0 1

Horizontal axis is coordinates for one sequence

Vertical axis is coordinates for the other


D O T P L O T S

Figure 7.4b

Dot-plot Fitch : Biochem. Genet. (1969)3,99-108

Can also score not 1 position at a time, but in sliding window. For example a window of 3 nucleotides where we score 1 for identical triplets and 0 for all other combinations yields.

A

C

G

T

C G T A C C G T

0 0 0 0 0 0

1 0 0 0 0 1




D O T P L O T S

Tis

sue

Pla

smin

ogen

Act

ivat

or (

PLA

T)

Coagulation Factor XII (F12)

Figure 7.4




D O T P L O T S

Tis

sue

Pla

smin

ogen

Act

ivat

or (

PLA

T)


Figure 7.4

K

K

Catalytic

Cat

aly

ticK

EF1EF2

EF

1

Plot dots for high similarity within a short window

Adjacent dots merge to form diagonal segments


D O T P L O T S

Tis

sue

Pla

smin

ogen

Act

ivat

or (

PLA

T)


Figure 7.4

K

K

Catalytic

Cat

aly

ticK

EF1EF2

EF

1

Repeated domains show a characteristic pattern


P A T H G R A P H S

Figure 7.5

90 137

72

23

90 137

72

23

PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72

EGF similarity domains of urokinse plasminogen activator (PLAU) and tissue plasminogen activator (PLAT)

Dot plots suggest paths through the alignment space

Path graphs are more explicit representations

Each path is a unique alignment


Routing a phone call from Washington DC to San Francisco

P A T H G R A P H S

Best-path problems are common in computer science

A best-path algorithm used for sequence alignment is called ‘dynamic programming’


G A T A C T AG A T T A C C A

Construct an optimal of these two sequences:

Using these scoring rules: Match:

Mismatch:Gap:

+1-1-1

D Y N A M I C P R O G R A M M I N G

Dynamic Programming Example



G A T A C T AGATTACCA

Arrange the sequence residues along a two-dimensional lattice

Vertices of the lattice fall between letters




The goal is to find the optimal path

from here

to here




Each path corresponds to a unique alignment

Which one is optimal?




The score for a path is the sum of its incremental edges scores

A aligned with AMatch = +1




The score for a path is the sum of its incremental edges scores A aligned with T

Mismatch = -1




The score for a path is the sum of its incremental edges scores

T aligned with NULL

Gap = -1

NULL aligned with T




Incrementally extend the path

0 -1

+1-1





0

+1-1

-2

-2

-1

Remember the best sub-path leading to each point on the lattice





0

-1

-2


0 +2

+1

-1

-20





0 -2


0 +2

+1

-1

-20

-2

-1





0


+1

-1

-2-1

-3-2

-3

-2

+3

-1

-1

0

0

+1

+1

+2





0


+1

-1

-1

-2

-2 0

0

+1+2

-5-4

-5

-4

-3

-3

-1 -3-2

-10

+1

+2

0

+1-1

+2

-3 -1

-2

+1 +3

+2 +1

+2+3






0

+1

-1

-1

-2

-2 0

0

+1+2

-4

-4

-3

-3

-1 -2

0

+2

0

+1-1

+2-2 +2 +1

+2+3

-8

-7

-6

-5

-7-6-5

-5-3

-2 -3

-4

-1

-1

0+1

+1

+1 +3

+2

-4

-6

-3

-2

-3

-1

-4

-5

+1 +3

+1

0 +2

+4

+4

+3

+2

+2

+3

-2 0

-1

+2 +2

+3




Trace-back to get optimal path and alignment

0

+1

-1

-1

-2

-2 0

0

+1+2

-4

-4

-3

-3

-1 -2

0

+2

0

+1-1

+2-2 +2 +1

+2+3

-8

-7

-6

-5

-7-6-5

-5-3

-2 -3

-4

-1

-1

0+1

+1

+1 +3

+2

-4

-6

-3

-2

-3

-1

-4

-5

+1 +3

+1

0 +2

+4

+4

+3

+2

+2

+3

-2 0

-1

+2 +2

+3




Print out the alignment

AA-TTTAACCTCAA

GG


Two different types of Alignment

Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial sub-path that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences.

Global Alignment methods:

Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph.

FASTP(Lipman &Pearson(1985),Science 227,1435-1441

BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments.

Local Alignment methods:


G L O B A L & L O C A L S I M I L A R I T Y

Implementations of dynamic programming for global and local similarities

Optimal global alignment

Needleman & Wunsch (1970)

Sequences align essentially from end to end

Optimal local alignment

Smith & Waterman (1981)

Sequences align only in small, isolated regions


Score and Statistics

Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other.

Scoring system that doesn’t penalize very much mutations to similar amino acid.

PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated.

BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments.



BLOSUM62

Figure 7.8

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4

A R N D C Q E G H I L K M F P S T W Y V

Some amino acid substitutions are more common than others

Substitution scores come from an odds ratio based on measured substitution rates



BLOSUM62

Figure 7.8

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4


Identities get positive scores, but some are better than others



BLOSUM62

Figure 7.8

A 4

R -1 5

N -2 0 6

D -2 -2 1 6

C 0 -3 -3 -3 9

Q -1 1 0 0 -3 5

E -1 0 0 2 -4 2 5

G 0 -2 0 -1 -3 -2 -2 6

H -2 0 1 -1 -3 0 0 -2 8

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4


Some non-identities have positive scores, but most are negative


D A T A B A S E S E A R C H I N G

Compare one query sequence against an entire database

> fasta myquery swissprot -ktup 2

search program

querysequence

sequencedatabase

optionalparameters

A typical search has four basic elements



With exponential database growth, searches keep taking more time


searching . . . . . .



The “hit list” gives titles and scores for matched sequences

> fasta myquery swissprot -ktup 2The best scores are: initn init1 opt z-sc E(77110)gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996 1262.1 0gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 382 395 507.6 1.4e-21gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 133 316 407.4 5.4e-16gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 98 190 253.1 2.1e-07gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 163 184 244.8 6.1e-07gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 164 170 227.2 5.8e-06gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 91 157 210.3 5.1e-05gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 125 148 199.7 0.0002gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 42 140 191.3 0.00058gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 73 139 188.7 0.00082gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT- 76 76 133 181.0 0.0022gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 27 119 165.2 0.017gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 66 118 163.0 0.022gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 65 116 160.5 0.03gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT- 52 52 117 160.3 0.031gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 66 115 159.3 0.035gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 66 112 155.5 0.057gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 73 112 155.4 0.058gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 76 110 153.8 0.072gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 58 104 138.5 0.51gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 47 103 137.8 0.56gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 63 98 131.3 1.3gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 58 99 129.4 1.6gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 48 91 122.9 3.7gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 50 92 121.9 4.3gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 36 85 121.3 4.7gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 36 84 120.0 5.5gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 45 90 118.9 6.3gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 48 92 117.4 7.6gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 59 89 117.0 8gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 48 97 117.0 8gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 38 83 116.8 8.3


E-value

“Hits” can be sorted according to their E-value or their score.

The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length.

E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters.

e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance.

E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that.

Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, low-complexity sequence…)


E-value

Another type of statistics is the P-value, which given a score S for an alignment is the Probability that an alignment of the query against a database of random sequences has a score >= S.For gapless alignments the P-value can be computed from theory.

Sometimes one has an alignments algorithms, or biologically complex databases that do not allow the computation of P-value based on the statistical theory of a uniform database. In this case, one computes uses an alternate statistics, the Z-value (e.g. FASTA suite), which shuffles the query sequence and thus creates many compositionally identical query sequence. Each random sequences is then re-queried agains the database. When done enough times, this provides a distribution of scores which is approximately normally distributed (if lucky) around some mean.

Z-value = score distance away from mean/ standard devuation

.. a Z-value of 3 or greater is good.

Prob

Distrib

Score

S = score of alignment

= Standard deviation

Deviation from mean



Detailed alignments are shown farther down in the output


>>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa)initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21Smith-Waterman score: 395; 52.3% identity in 109 aa overlap

10 20 30 40 50gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..:::gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60

60 70 80 90 100 110gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.:gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120

120 130 140gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ ..gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180

>>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa)initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16Smith-Waterman score: 316; 37.4% identity in 131 aa overlap

10 20 30 40gi|170 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X :


H A S H I N G M E T H O D S

Query sequence

Dat

abas

e se

quen

ce

Simplest Database searching could is a large dynamic programming example.

For a query of N letters against a database of M letters, it requires MxN comparisons.



Hashing is a common method for accelerating database searches

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

Compile “dictionary” of words from the query sequence. Put each word in a look-up table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation.


Index lookup

Each word is assigned a unique integer.

E.g. for a word of 3 letters made up of an alphabet of 20 letters.

1. Assign a code to each letter Code(l) (0 to 19)

2. For a word of 3 letters L1 L2 L3 the code is

index = Code(L1)*202 + Code(L2)*201 + Code(L3)

3. Have an array with a list of the positions that have that word.

1

0 1 2 3

Position in query sequence of word



Building the dictionary for the query sequence requires (N-2) operations.

MLILII

MLIIKRDELVISWASHEREquery sequence

IIKIKRKRDRDEDELELVLVIVISISWSWAWASASHSHEHERERE

all overlappingwords of size 3

The database contains (M-2) words, and it takes only one operation to see if the word was in the query.



Query sequence

Dat

abas

e se

quen

ce

Scan the database, looking up words in the dictionary

Use word hits to determine were to search for alignments

fills the dynamic programming matrix

in (N-2)+(M-2) operations instead

of MxN.



Query sequence

Dat

abas

e se

quen

ce



FASTA searches in a band



Query sequence

Dat

abas

e se

quen

ce



BLAST extends from word hits


Multiple Alignment

FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV...

APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV...

HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIV PGHVLI...

Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...

A true multiple alignment method

will align all the sequences

together at the same time.

FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...

APH1_SCHPO -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...

HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI...

Y866_METJA -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...


Multiple Alignment

A true multiple alignment method

will align all the sequences

together at the same time.




Y866_METJA -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...

Unfortunately, there is no formal computationally tractable method for more than 3 sequences.

There are many approximate methods, such as Progressive multiple alignment methods.


Progressive Multiple Alignment

FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA

FHIT_HUMAN

APH1_SCHPO 395

HNT2_YEAST 316 380

Y866_METJA 290 300 340

Pairwise alignments: compute distance matrix

APH1_SCHPO

HNT2_YEAST Y866_METJA

FHIT_HUMAN

Align all pairs of sequences.


Progressive Multiple Alignment

FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA

FHIT_HUMAN

APH1_SCHPO 395

HNT2_YEAST 316 380

Y866_METJA 290 300 340

Pairwise alignments: compute distance matrix

APH1_SCHPO

HNT2_YEAST

Y866_METJA

FHIT_HUMANGuide Tree


Multiple Alignment

Align two closest sequences

FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV...

APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV...

HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI...


FHIT_HUMAN MSFR FGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...

APH1_SCHPO MPKQ LYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...


Y866_METJA MCIF CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...

This alignment creates a consensus sequence that is next used to align

subsequent sequences.

From the point of view of this pairwise alignment, the gap can be inserted anywhere

In the green region (between the 1st M , and base 13 (S))


Multiple Alignment

Align Next closest sequence to the

consensus.

FHIT_HUMAN MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...

APH1_SCHPO MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV...


Y866_METJA MCIFCKIINGEIP-AKVVYEDEHVLAFLDINPRNKGHTLV...

FHIT_HUMAN -----------MSF RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...

APH1_SCHPO -----------MPK QLYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...

HNT2_YEAST MILSKTKKPKSMNK PIYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI...

Y866_METJA MCIF CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...

Once inserted gap position cannot move because they are part of the consensus.






Multiple Alignment

FHIT_HUMAN -----------MSFR FGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV...

APH1_SCHPO -----------MPKQ LYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV...

HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI...

Y866_METJA -----------MCIF CKIINGEIPAKVVY EDEHVLAFLDINPRN KGHTLV...

Hopefully, the result should be similar to what a true multiple alignment

method would have yielded. We saw that the order of alignment determines

the existence of gaps.

Align Next closest sequence to new

consensus.

Because of the order of alignments, the gap position cannot be changed to align these two P,

which would have resulted in a higher score.


CLUSTALW

Clustalw: is a progressive multiple alignment tool.

- Adaptive gap opening and extension scores, makes it relatively insensitive to small changes in gap parameters.

- Choice of DNA or protein gap penalty alignments.

- Available on the web or on PC/Mac/unix.

http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html

The uppercase “O” in options is relevant.


BLAST and BLAST2SEQUENCES

BLAST is a database search engine based on

using hashing to accelerate the search.

blastn (for nucleotides) orblastp (for proteins)blastx (translates a nucleotide query in all 6 reading frames

and compare it to a protein database.)tblastn (compare a protein against a nucleotide database

translated in all 6 reading frames.)tblastx (compares a nucleotide sequence against a

nucleotide database by translating the query and database in all 6 reading frames.)

http://www.ncbi.nlm.nih.gov/BLAST/

A pairwise alignment implementation of these

program is available at:

http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html


Query-Anchored Alignments (master Slave)

Clustalw:

Is a multiple alignment program. Every Sequence is aligned to every other one.NOT a multiple alignment program, but may display Query-Anchored multiple pairwise alignments that look like multiple alignment, but all the sequences are only aligned to the first sequence!

Gap in subject sequence

This Column is NOT aligned together. It is displayed there for convenience.

Gaps in the query, means NOTHING

can be aligned to it. Gaps may optionally be shown(flat view),

or entire column omitted.

Blast:


BLAST and BLAST2SEQUENCES

Exercizes: Use Entrez to find the protein sequences with LOCUS name

FHIT_HUMAN

HNT2_YEAST

Use clustalw to align these two sequences,

And WITHOUT LOSING THAT RESULT SCREEN!!!

Use pairwise blast to align these two sequences as well.

EXERCIZE: Try to reproduce the example of clustalW alignment (the order of input sequences is not important)


References

TextBook: "Bioinformatics" A Practical Guide to the Analysis of Genes

and Proteins. Edited by Andy D. Baxevanis and B.F. Ouellette

readings: chapters 7,8,9

http://www.ncbi.nlm.nih.gov/BLAST/blast_overview.html

Documents

Computing in Molecular Biology