28
CS3000: Algorithms & Data Jonathan Ullman Lecture 9: Dynamic Programming: Edit Distance, RNA Folding Oct 5, 2018

CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

CS3000:Algorithms&DataJonathanUllman

Lecture9:• DynamicProgramming:EditDistance,RNAFolding

Oct5,2018

Page 2: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

OfficeHourssched2

IIIsoso.IE oos7To s oo 31 s

Page 3: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

EditDistanceAlignments

Page 4: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DistanceBetweenStrings

• Autocorrectworksbyfindingsimilarstrings

• ocurrance andoccurrence seemsimilar,butonlyifwedefinesimilaritycarefully

ocurranceoccurrence

oc urranceoccurrenceit

Page 5: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

EditDistance/Alignments

• Giventwostrings! ∈ Σ$, & ∈ Σ',theeditdistanceisthenumberofinsertions,deletions,andswapsrequiredtoturn! into&.

• Givenanalignment,thecostisthenumberofpositionswherethetwostringsdon’tagree

o c u r r a n c eo c c u r r e n c e

I

O OyAHou gaps but not

transpositions

cost of the alignment is the of columns where too

symbols dont agree

Page 6: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

AsktheAudience

• Whatistheminimumcostalignmentofthestringssmitten andsitting

editdistance

Edit Dist 3

s m i t t e n

s i t t i n gL 2 3

smitten sitter sittin sitting

Page 7: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

EditDistance/Alignments

• Input: Twostrings! ∈ Σ$, & ∈ Σ'• Output: Theminimumcostalignmentof! and&• EditDistance=costoftheminimumcostalignment

on goesin theAnand

Page 8: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• Considertheoptimal alignmentof!, &• Threechoicesforthefinalcolumn• CaseI:onlyuse! (!$,− )• CaseII:onlyuse& (−, &' )• CaseIII:useonesymbolfromeach(!$, &' )

optimalalignment optimalalignment optimalalignment

Xl Xn 1 Xn X1 Xn Xi Xu 1 Xn

Y Jm Yi yn i ym Yi Ym i ym

To determine which case is best solve alignmenton a smaller X Y

Page 9: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• Considertheoptimal alignmentof!, &• CaseI:onlyuse! (!$, − )• deletion+optimalalignmentof!):$+), &):'

• CaseII:onlyuse& (−, &' )• insertion+optimalalignmentof!):$, &):'+)

• CaseIII:useonesymbolfromeach(!$, &' )• If!$ = &':optimalalignmentof!):$+), &):'+)• If!$ ≠ &':mismatch+opt.alignmentof!):$+), &):'+)

Page 10: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• ./0 1, 2 =costofopt.alignmentof!):3 and&):4• CaseI:onlyuse! (!3, − )• CaseII:onlyuse& (−, &4 )• CaseIII:useonesymbolfromeach(!3, &4 )

f nti ma subproblems

I 1 OPT i l jl t OPT i j 1

OPT i t j t t 1 if Xi y0PT i jope f j M if Xi yj

opt iI mm OPT i bj OPT i j t OPT i i j i if yjmm It OPT fit HOPT i j D OPTCi t j D x yj

Page 11: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• ./0 1, 2 =costofopt.alignmentof!):3 and&):4• CaseI:onlyuse! (!3, − )• CaseII:onlyuse& (−, &4 )• CaseIII:useonesymbolfromeach(!3, &4 )

Recurrence:

OPT 8, 9 = : 1 + min OPT 8 − 1, 9 , OPT 8, 9 − 1 , OPT(8 − 1, 9 − 1)min{1 + DEF 8 − 1, 9 , 1 + OPT 8, 9 − 1 , OPT(8 − 1, 9 − 1)}

BaseCases:OPT 8, 0 = 8,OPT 0, 9 = 9

Page 12: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

Examplex = perty = beast

- b e a s t-pert

Ogl 2 3 4

51I 2 3 4

522RI B 2 3 4

3 3 2 2 3q4

4 4 3 3theft

Page 13: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

FindingtheAlignment

• ./0 1, 2 =costofopt.alignmentof!):3 and&):4• CaseI:onlyuse! (!3, − )• CaseII:onlyuse& (−, &4 )• CaseIII:useonesymbolfromeach(!3, &4 )

Page 14: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

EditDistance(“Bottom-Up”)

// All inputs are global varsFindOPT(n,m):M[0,j]←j, M[i,0]←i

for (i= 1,…,n):for (j = 1,…,m):if (xi = yj):M[i,j] = min{1+M[i-1,j],1+M[i,j-1],M[i-1,j-1]

elseif (xi != yj):M[i,j] = 1+min{M[i-1,j],M[i,j-1],M[i-1,j-1]}

return M[n,m]

Ohm entries x Oli perentry O nm time

01hm space just for the table

Page 15: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

AsktheAudience

• Supposeinserting/deletingcostsJ > L andswappingM ↔ O costsPM,O > L• Writearecurrenceforthemin-costalignment

Page 16: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

EditDistanceSummary

• Computetheeditdistance,ormin-costalignmentbetweentwostringsintime/spaceD QR

• DynamicProgramming:• Decidethefinalpairofsymbolsinthealignment

• Spacecanbeprohibitiveinpractice• ComputeeditdistanceinspaceD min Q,R• CanalsofindalignmentinspaceD Q +R usingacleverdivide-and-conqueralgorithm!

Page 17: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

RNAFolding

Page 18: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DNA

• DNAisastringoffourbases{A,C,G,T}• TwocomplementarystrandsofDNAsticktogetherandformadoublehelix• A—T andC—G arecomplementarypairs

Page 19: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

RNAFolding

• RNAisastringoffourbases{A,C,G,U}• AsingleRNAstrandstickstoitselfandfoldsintocomplexstructures• A—U andC—G arecomplementarypairs

Page 20: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

RNAFolding

• RNAstrandwilltrytominimizeenergy (formthemostbonds)subjecttoconstraints

c

crossing

i

Iotcomplementary

sharp turn

Page 21: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

RNAFolding

• RNAisastringofbasesOS,… , OU ∈ V, W, X, Y• ThestructureisgivenbyasetofbondsZ consistingofpairs 8, 9 with8 < 9• (Complements)Only\ − ] or^ − _ canbepaired• (Matching)Nobase`3 isintwopairsinZ• (NoSharpTurns)If 8, 9 ∈ Z,then8 < 9 − 4• (Non-Crossing)If 8, 9 , b, ℓ ∈ Z thenitcannotbethecasethat8 < b < 9 < ℓf

f IT TET tuft 5 j

Page 22: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

RNAFolding

• Input:RNAsequence`), … , `$ ∈ \, ^, _, ]• Output:AsetofpairsZ ⊆ 1,… , Q × 1,… , Q• Goal:maximizethesizeofZ• (Complements)Only\ − ] or^ − _ canbepaired• (Matching)Nobase`3 isintwopairsinZ• (NoSharpTurns)If 8, 9 ∈ Z,then8 < 9 − 4• (Non-Crossing)If 8, 9 , b, ℓ ∈ Z thenitcannotbethecasethat8 < b < 9 < ℓ

Page 23: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• LetD betheoptimalsetofpairsfor`) ⋯`$• Case1: Q pairswithnothinginD

• Case2:Q pairswithsomeg < Q − 4 inD

Thes O is the opt set of pairs for b sbn

Then 0 is It n t opt set of pans for bee bopt set of pairs for b be

non crossing

Moon an l n

Page 24: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• LetD3,4 betheoptimalsetofpairsfor`3 ⋯ 4̀• Case1: 9 pairswithnothinginD3,4

• Case2:9 pairswithsomeg < 9 − 4 inD3,4

Page 25: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• LetOPT 8, 9 betheopt.number ofpairsfor`3 ⋯ 4̀• Case1: 9 pairswithnothinginD3,4

• Case2t:9 pairswithg < 9 − 4 inD3,4

OPT i j OPT i j l

OPT i j It OPT i t t OPT ttt j MConsider

any ts 1 tej 4 and bigbj are complements

1OPT I t l OPTH11 j l

Page 26: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

DynamicProgramming

• LetOPT 8, 9 betheopt.number ofpairsfor`3 ⋯ 4̀• Case1: 9 pairswithnothinginD3,4• Case2t:9 pairswithg < 9 − 4 inD3,4Recurrence:OPT 8, 9= max OPT 8, 9 − 1 ,max OPT 8, g − 1 + OPT g + 1, 9 − 1

BaseCases:OPT 8, 9 = 0 if8 ≥ 9 − 4

Maximumoverallg suchthat• 8 ≤ g < 9 − 4• `l, 4̀ arecompatiblebases

I

Page 27: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

FillingtheTable

Recurrence:OPT 8, 9 = max OPT 8, 9 − 1 , maxmnoopqrsl OPT 8, g − 1 + OPT g + 1, 9 − 1

6 7 8 j=9

4 0 0 0

3 0 0

2 0

i=1

Sequence:\^^__]\_]

r's

n 5 i entries

O n perentry

Page 28: CS3000: Algorithms & Data Jonathan Ullman · Edit Distance / Alignments •Given two strings !∈Σ$,&∈Σ', the edit distance is the number of insertions, deletions, and swaps required

RNAFoldingSummary

• ComputetheoptimalRNAfolding intimeD QtandspaceD Qu

• DynamicProgramming:• Decideonanoptimalpair`l − `$• RemainingRNAistwonon-overlappingpieces• Addingvariables: onesubproblemforeachinterval

• Non-crossing andmatching arecritical• Thinkabouthowthedynamicprogrammingalgorithmchangesifweremoveeachoftheconditions