Intro Sequence comparisons Visualization Alignments Scoring

Preview:

Citation preview

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Last time

• Introduction• What is Bioinformatics?• Databases in Bioinformatics

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Today: Sequence comparisons

• Visualisation• Different objectives• Pairwise alignments

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Sequence comparisons: Goals

• What are the similarities?• Local similarities — domains and motifs• What is variable?

• Identify positions — basis for evolutionarystudies

• Understand structural similarities• Determine ancestry

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Sequence comparisons: Goals

• What are the similarities?• Local similarities — domains and motifs• What is variable?

• Identify positions — basis for evolutionarystudies

• Understand structural similarities• Determine ancestry

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Sequence comparisons: Goals

• What are the similarities?• Local similarities — domains and motifs• What is variable?

• Identify positions — basis for evolutionarystudies

• Understand structural similarities

• Determine ancestry

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Sequence comparisons: Goals

• What are the similarities?• Local similarities — domains and motifs• What is variable?

• Identify positions — basis for evolutionarystudies

• Understand structural similarities• Determine ancestry

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Homology

• Definition: Homology = common ancestry

• Principle: Similarity⇒homology• Quote: ”These sequences are somewhat

homologous”. Bad!

Similarity 6= homology

• Correct: ”These sequences are somewhatsimilar”.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Homology

• Definition: Homology = common ancestry• Principle: Similarity⇒homology

• Quote: ”These sequences are somewhathomologous”. Bad!

Similarity 6= homology

• Correct: ”These sequences are somewhatsimilar”.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Homology

• Definition: Homology = common ancestry• Principle: Similarity⇒homology• Quote: ”These sequences are somewhat

homologous”.

Bad!

Similarity 6= homology

• Correct: ”These sequences are somewhatsimilar”.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Homology

• Definition: Homology = common ancestry• Principle: Similarity⇒homology• Quote: ”These sequences are somewhat

homologous”. Bad!

Similarity 6= homology

• Correct: ”These sequences are somewhatsimilar”.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Important questions

• When are two sequences significantlysimilar?

• How do we evaluate similarity?

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Important questions

• When are two sequences significantlysimilar?

• How do we evaluate similarity?

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Data

• DNA: genes, genomes, non-coding DNA,etc

• Codons• RNA• Peptides

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Idea of dotplots

Q V A S K I N T N ES

V

A

T

K

I

YMN

• •

E

Put dot where identical residues

, then filter outrandomness

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Idea of dotplots

Q V A S K I N T N ES •V •A •T •K •I •YMN • •E •

Put dot where identical residues

, then filter outrandomness

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Idea of dotplots

Q V A S K I N T N ES

V •A •T

K •I •YMN • •E •

Put dot where identical residues, then filter outrandomness

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)

0 100

0

50

100

150

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)

0 100

0

50

100

150

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)

0 100

0

50

100

150

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Dotplots in practicePttMAP20 (horizontal) vs. OsMAP20 (vertical)

0 100

0

50

100

150

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

What happened here?

s1: A B C Ds2: A C B D

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

What happened here?

s1: A B C Ds2: A C B D

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Genomic dotplot

Many inversions around origin and termini of replication.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Genomic dotplot

Many inversions around origin and termini of replication.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Visualizing with alignmentOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRR

S+ +PK + ++ +P F+LHT +RA+KRA FNY VA+KI NE +RPttMAP20 43 SKVAPKPFAKENTKPQE-FKLHTGQRALKRAMFNYSVATKIYMNEQQKR

OsMAP20 118 FEEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEE++ K+IEE E++ MRKEMV +AQLMP FD+PF PQRS+RPLTVP+E

PttMAP20 91 QIERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPRE

OsMAP20 167 PSFPSF

PttMAP20 140 PSF

OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.

PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91

OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||

PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141

OsMAP20 169 F--LRLKC--CI 176| :..|| ||

PttMAP20 142 FHMVNSKCWSCI 153

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Visualizing with alignmentOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRR

S+ +PK + ++ +P F+LHT +RA+KRA FNY VA+KI NE +RPttMAP20 43 SKVAPKPFAKENTKPQE-FKLHTGQRALKRAMFNYSVATKIYMNEQQKR

OsMAP20 118 FEEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEE++ K+IEE E++ MRKEMV +AQLMP FD+PF PQRS+RPLTVP+E

PttMAP20 91 QIERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPRE

OsMAP20 167 PSFPSF

PttMAP20 140 PSF

OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.

PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91

OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||

PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141

OsMAP20 169 F--LRLKC--CI 176| :..|| ||

PttMAP20 142 FHMVNSKCWSCI 153

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Alignments

• Def: A pairwise alignment is a pairing ofsymbols between two sequences.

• Global alignment: Involves wholesequences.

• Local alignment: Involves parts ofsequences.

• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Alignments

• Def: A pairwise alignment is a pairing ofsymbols between two sequences.

• Global alignment: Involves wholesequences.

• Local alignment: Involves parts ofsequences.

• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Alignments

• Def: A pairwise alignment is a pairing ofsymbols between two sequences.

• Global alignment: Involves wholesequences.

• Local alignment: Involves parts ofsequences.

• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Alignments

• Def: A pairwise alignment is a pairing ofsymbols between two sequences.

• Global alignment: Involves wholesequences.

• Local alignment: Involves parts ofsequences.

• Semiglobal or ends-free alignment: Ignore”overhang” in similar sequences withdifferent lengths

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Global vs localOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118

|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91

OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||

PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141

OsMAP20 169 F--LRLKC--CI 176| :..|| ||

PttMAP20 142 FHMVNSKCWSCI 153

OsMAP20 1 MEK--TRKATSPKSSMTSSTGPKSPVRNGGSPPHKKSTSEFRGRKNESQI 48||| |:.|.......:|.:.|.|....|.:....|..

PttMAP20 1 MEKAHTKSALKKLVKASSQSAPWSNAARGMAKDDLKDP------------ 38

OsMAP20 49 FRKGGQDSITLDESKRRSPTSQTSPKRSSPKHEQPLSYFRLHTEERAIKR 98..|:|| .:||..:.::.:| ..|:|||.:||:||

PttMAP20 39 ---------LYDKSK-------VAPKPFAKENTKP-QEFKLHTGQRALKR 71

OsMAP20 99 AGFNYQVASKINTNEIIRRFEEKLSKVIEEREIKMMRKEMVHKAQLMPAF 148|.|||.||:||..||..:|..|::.|:|||.|::.||||||.:|||||.|

PttMAP20 72 AMFNYSVATKIYMNEQQKRQIERIQKIIEEEEVRTMRKEMVPRAQLMPYF 121

OsMAP20 149 DKPFHPQRSTRPLTVPKEPSF--LRLKC--CIGGEFHRHFCYNA------ 188|:||.||||:||||||:|||| :..|| ||..:...::..:|

PttMAP20 122 DRPFFPQRSSRPLTVPREPSFHMVNSKCWSCIPEDELYYYFEHAHPHDHA 171

OsMAP20 189 -KAIK 192|.:|

PttMAP20 172 WKPVK 176

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

More terminology

• Insertion• Deletion• Indel — when we don’t know• Gap — indel in an alignment• Indel character: usually ”–”

1 MEK--TRKATSPKSSMTSSTGPKSPVRNGGSPPHKKSTSEFRGRKNESQI 48||| |:.|.......:|.:.|.|....|.:....|..

1 MEKAHTKSALKKLVKASSQSAPWSNAARGMAKDDLKDP------------ 38

49 FRKGGQDSITLDESKRRSPTSQTSPKRSSPKHEQPLSYFRLHTEERAIKR 98..|:|| .:||..:.::.:| ..|:|||.:||:||

39 ---------LYDKSK-------VAPKPFAKENTKP-QEFKLHTGQRALKR 71

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Choosing alignment?OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRR

S+ +PK + ++ +P F+LHT +RA+KRA FNY VA+KI NE +RPttMAP20 43 SKVAPKPFAKENTKPQE-FKLHTGQRALKRAMFNYSVATKIYMNEQQKR

OsMAP20 118 FEEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEE++ K+IEE E++ MRKEMV +AQLMP FD+PF PQRS+RPLTVP+E

PttMAP20 91 QIERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPRE

OsMAP20 167 PSFPSF

PttMAP20 140 PSF

OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.

PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91

OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||

PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141

OsMAP20 169 F--LRLKC--CI 176| :..|| ||

PttMAP20 142 FHMVNSKCWSCI 153

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Principle: Identity• Def: The identity in an alignment is the

fraction of identical paired symbols.• Early selection criteria: Choose alignment

with highest identity

Here: 62112 ≈ 55% identity

OsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.

PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91

OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||

PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141

OsMAP20 169 F--LRLKC--CI 176| :..|| ||

PttMAP20 142 FHMVNSKCWSCI 153

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Principle: Identity• Def: The identity in an alignment is the

fraction of identical paired symbols.• Early selection criteria: Choose alignment

with highest identityHere: 62

112 ≈ 55% identityOsMAP20 69 SQTSPKRSSPKHEQPLSYFRLHTEERAIKRAGFNYQVASKINTNEIIRRF 118

|:.:||..:.::.:| ..|:|||.:||:|||.|||.||:||..||..:|.PttMAP20 43 SKVAPKPFAKENTKP-QEFKLHTGQRALKRAMFNYSVATKIYMNEQQKRQ 91

OsMAP20 119 EEKLSKVIEEREIKMMRKEMVHKAQLMPAFDKPFHPQRSTRPLTVPKEPS 168.|::.|:|||.|::.||||||.:|||||.||:||.||||:||||||:|||

PttMAP20 92 IERIQKIIEEEEVRTMRKEMVPRAQLMPYFDRPFFPQRSSRPLTVPREPS 141

OsMAP20 169 F--LRLKC--CI 176| :..|| ||

PttMAP20 142 FHMVNSKCWSCI 153

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Scoring an alignment

• Identity looses info on similarity

• Better: assign score to every pair ofsymbols. s(x , y) = cExample: for DNA

s A T G CA 2 -1 1 -1T -1 2 -1 1G 1 -1 2 -1C -1 1 -1 2

• Indel scores: s(x ,−) = s(−, x)?= −1

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Scoring an alignment

• Identity looses info on similarity• Better: assign score to every pair of

symbols. s(x , y) = cExample: for DNA

s A T G CA 2 -1 1 -1T -1 2 -1 1G 1 -1 2 -1C -1 1 -1 2

• Indel scores: s(x ,−) = s(−, x)?= −1

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Scoring an alignment

• Identity looses info on similarity• Better: assign score to every pair of

symbols. s(x , y) = cExample: for DNA

s A T G CA 2 -1 1 -1T -1 2 -1 1G 1 -1 2 -1C -1 1 -1 2

• Indel scores: s(x ,−) = s(−, x)?= −1

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Scoring an alignment• Alignment x , y from sequences x and y .

E.g.: x = AAGTT, y = AATT, alignment isx AAGTTy AA-TT

• Alignment score is

S(x , y) =

|x |∑i=1

s(xi , yi)

• Here:

S(x , y) = s(A, A) + s(A, A)

+ s(G,−) + s(T , T ) + s(T , T )

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Scoring an alignment• Alignment x , y from sequences x and y .

E.g.: x = AAGTT, y = AATT, alignment isx AAGTTy AA-TT

• Alignment score is

S(x , y) =

|x |∑i=1

s(xi , yi)

• Here:

S(x , y) = s(A, A) + s(A, A)

+ s(G,−) + s(T , T ) + s(T , T )

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

How do we choose an alignment?

• Want to choose best global alignment• Many alignments• Given x = x1x2 · · · xm and y = y1y2 · · · yn,

find x , y that maximize score S(x , y).

• Idea: Find best way of ending alignment

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

How do we choose an alignment?

• Want to choose best global alignment• Many alignments• Given x = x1x2 · · · xm and y = y1y2 · · · yn,

find x , y that maximize score S(x , y).• Idea: Find best way of ending alignment

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

How to end alignment: alternativesOne of:

x1 · · · xm−1y1 · · · yn−1

xmyn

Mm−1,n−1 + s(xm, yn)

or

x1 · · · xm−1y1 · · · yn

xm−

Mm−1,n + s(xm,−)

or

x1 · · · xmy1 · · · yn−1

−yn

Mm,n−1 + s(−, yn)

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

How to end alignment: alternativesOne of:

x1 · · · xm−1y1 · · · yn−1

xmyn

Mm−1,n−1 + s(xm, yn)

or

x1 · · · xm−1y1 · · · yn

xm−

Mm−1,n + s(xm,−)

or

x1 · · · xmy1 · · · yn−1

−yn

Mm,n−1 + s(−, yn)

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

How to end alignment: alternativesOne of:

x1 · · · xm−1y1 · · · yn−1

xmyn

Mm−1,n−1 + s(xm, yn)

or

x1 · · · xm−1y1 · · · yn

xm− Mm−1,n + s(xm,−)

or

x1 · · · xmy1 · · · yn−1

−yn

Mm,n−1 + s(−, yn)

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

How to end alignment: alternativesOne of:

x1 · · · xm−1y1 · · · yn−1

xmyn

Mm−1,n−1 + s(xm, yn)

or

x1 · · · xm−1y1 · · · yn

xm− Mm−1,n + s(xm,−)

or

x1 · · · xmy1 · · · yn−1

−yn

Mm,n−1 + s(−, yn)

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

A rekursion for max alignment score

Note: for global alignment

M0,0 = 0

Mm,n = max

Mm−1,n−1 + s(xm, yn) m > 0, n > 0Mm−1,n + s(xm,−) m > 0, n ≥ 0Mm,n−1 + s(−, yn) m ≥ 0, n > 0

We get:Mm,n = max

x ,yS(x , y)

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Computing Mm,n

• Keep Mi ,j in a table• Table + Rekursion = Dynamic Programming• Needleman-Wunch algorithm

• mn elements in table⇒Time complexity is ∼ mn.

• When filling the table, note alternatives.• Backtracking for retrieving the alignment.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

Computing Mm,n

• Keep Mi ,j in a table• Table + Rekursion = Dynamic Programming• Needleman-Wunch algorithm• mn elements in table

⇒Time complexity is ∼ mn.• When filling the table, note alternatives.• Backtracking for retrieving the alignment.

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

DP and backtracking

From Eddy, Nature Biotech, 2004

Intro Sequence comparisons Visualization Alignments Scoring Algorithms

DP for local alignments

• Smith-Waterman algorithm• Allow ”restarting” from zero.

M0,0 = 0

Mm,n = max

Mm−1,n−1 + s(xm, yn) m > 0, n > 0Mm−1,n + s(xm,−) m > 0, n ≥ 0Mm,n−1 + s(−, yn) m ≥ 0, n > 00 ← Here!

Recommended