15.4 Longest common subsequence - TUNIelomaa/teach/AADS-2016-7.pdf• It is not alongest common subsequence (LCS) of :and ; • The sequence $, %, $, #is also common to both :and ;and

10/12/2016

1

15.4 Longest common subsequence

• Biological applications often need to comparethe DNA of two (or more) different organisms

• A strand of DNA consists of a string of moleculescalled bases, where the possible bases areAdenine, Guanine, Cytosine, and Thymine

• We express a strand of DNA as a string over thealphabet {A,C,G,T}

• E.g., the DNA of two organisms may be– = ACCGGTCGAGTGCGCGGAAGCCGGCCGAA– = GTCGTTCGGAATGCCGTTGCTCTGTAAA

12-Oct-16MAT-72006 AADS, Fall 2016 328

• By comparing two strands of DNA we determinehow “similar” they are, as some measure of howclosely related the two organisms are

• We can de ne similarity in many different ways• E.g., we can say that two DNA strands are

similar if one is a substring of the other– Neither nor is a substring of the other

• Alternatively, we could say that two strands aresimilar if the number of changes needed to turnone into the other is small

12-Oct-16MAT-72006 AADS, Fall 2016 329

10/12/2016

2

• Yet another way to measure the similarity ofand is by nding a third strand in which thebases in appear in each of and• these bases must appear in the same order, but not

necessarily consecutively• The longer the strand we can nd, the more

similar and are• In our example, the longest strand is

– = ACCGGTCGAGTGCGCGGAAGCCGGCCGAA– = GTCGTTCGGAATGCCGTTGCTCTGTAAA– = GTCGTCGGAAGCCGGCCGAA

12-Oct-16MAT-72006 AADS, Fall 2016 330

• Formalize this notion of similarity as the longest-common-subsequence problem

• A subsequence is just the given sequence withzero or more elements left out

• Formally, given a sequence = , , … , ,another sequence = , , … , is asubsequence of if there exists a strictlyincreasing sequence , , … , of indices ofsuch that for all = 1,2, … , , we have =

• For example, = , , , is a subsequenceof = , , , , , , with correspondingindex sequence 2,3,5,7

12-Oct-16MAT-72006 AADS, Fall 2016 331

10/12/2016

3

• A sequence is a common subsequence ofand if is a subsequence of both and

• For example, if = , , , , , , and =, , , , , , the sequence , , is a

common subsequence of both and• It is not a longest common subsequence

(LCS) of and• The sequence , , , is also common to

both and and has length 4• This sequence is an LCS of and , as is

, , , ; and have no commonsubsequence of length 5 or greater

12-Oct-16MAT-72006 AADS, Fall 2016 332

• In longest-common-subsequence problem, we aregiven = , , … , and = , , … , and wishto nd a max-length common subsequence of and

Step 1: Characterizing a longest common subsequence• In a brute-force approach, we would enumerate all

subsequences of and check each of them to seewhether it is also a subsequence of , keeping track ofthe longest subsequence we nd

• Each subsequence of corresponds to a subset of theindices 1,2, … , of

• Because has 2 subsequences, this approachrequires exponential time, making it impractical for longsequences

12-Oct-16MAT-72006 AADS, Fall 2016 333

10/12/2016

4

• The LCS problem has an optimal-substructureproperty, however, as the following theoremshows

• The natural classes of subproblems correspondto pairs of “pre xes” of the two input sequences

• Precisely, given a sequence = , , … , ,we de ne the th pre x of , for = 0,1, … , , as

= , , … ,• For example, if = , , , , , , , then =

, , , and is the empty sequence

12-Oct-16MAT-72006 AADS, Fall 2016 334

Theorem 15.1 (Optimal substructure of LCS)Let = , , … , and = , , … , besequences, and let = , , … , be any LCSof and .1. If = , then = = and is an

LCS of and .2. If , then implies that is an

LCS of and .3. If , then implies that is an

LCS of and .

12-Oct-16MAT-72006 AADS, Fall 2016 335

10/12/2016

5

Proof (1) If , then we could append =to to obtain a common subsequence of and

of length + 1, contradicting the supposition thatis a LCS of and . Thus, we must have =

= . Now, the pre x is a length-( 1)common subsequence of and . We wishto show that it is an LCS. Suppose for the purposeof contradiction that there exists a commonsubsequence of and with lengthgreater than 1. Then, appending = toproduces a common subsequence of and

whose length is greater than , which is acontradiction.

12-Oct-16MAT-72006 AADS, Fall 2016 336

(2) If , then is a common subsequence ofand . If there were a common subsequence

with length greater than , then would alsobe a common subsequence of and ,contradicting the assumption that is an LCS ofand .(3) The proof is symmetric to (2).

• Theorem 15.1 tells us that an LCS of twosequences contains within it an LCS of pre xesof the two sequences

• Thus, the LCS problem has an optimal-substructure property

12-Oct-16MAT-72006 AADS, Fall 2016 337

10/12/2016

6

Step 2: A recursive solution• We examine either one or two subproblems

when nding an LCS of and• If = , we nd an LCS of and• Appending = yields an LCS of and• If , then we (1) nd an LCS of and

and (2) nd an LCS of and• Whichever of these two LCSs is longer is an

LCS of and• These cases exhaust all possibilities, and we

know that one of the optimal subproblemsolutions must appear within an LCS of and

12-Oct-16MAT-72006 AADS, Fall 2016 338

• To nd an LCS of and , we may need to ndthe LCSs of and and of and

• Each subproblem has the subsubproblem ofnding an LCS of and

• Many other subproblems share subsubproblems• As in the matrix-chain multiplication, recursive

solution to the LCS problem involves arecurrence for the value of an optimal solution

• Let us de ne [ , ] to be the length of an LCS ofthe sequences and

• If either = 0 or = 0, one of the sequences haslength 0, and so the LCS has length 0

12-Oct-16MAT-72006 AADS, Fall 2016 339

10/12/2016

7

• The optimal substructure of the LCS problem gives

, =0 if = 0 or = 0

1, 1 + 1 if , > 0 and =max( , 1 , 1, ) if , > 0 and

• Observe that a condition in the problem restricts whichsubproblems we may consider

• When = , we consider nding an LCS of and

• Otherwise, we instead consider the two subproblems ofnding an LCS of and and of and

• In the previous dynamic-programming algorithms — forrod cutting and matrix-chain multiplication — we ruledout no subproblems due to conditions in the problem

12-Oct-16MAT-72006 AADS, Fall 2016 340

Step 3: Computing the length of an LCS• Since the LCS problem has only distinct

subproblems, we can use dynamic programming tocompute the solutions bottom up

• LCS-LENGTH stores the [ , ] values in [0. . , 0. . ],and it computes the entries in row-major order– I.e., the procedure lls in the rst row of from left to

right, then the second row, and so on• The procedure also maintains the table [1. . , 1. . ]• Intuitively, [ , ] points to the table entry

corresponding to the optimal subproblem solutionchosen when computing ,

• , contains the length of an LCS of and

12-Oct-16MAT-72006 AADS, Fall 2016 341

10/12/2016

8

LCS-LENGTH( , )1. .2. .3. let [1. . , 1. . ] and

[0. . , 0. . ] be newtables

4. for 1 to5. , 0 06. for 0 to7. 0, 08. for 1 to9. for 1 to

10. if =11. , 1, 1 + 112. [ , ”13. elseif [ 1, [ , 1]14. , [ 1, ]15. [ , ”16. else , [ , 1]17. , ”18. return and

Running time: ( )

12-Oct-16MAT-72006 AADS, Fall 2016 342

12-Oct-16MAT-72006 AADS, Fall 2016 343

The and tables computed by LCS-LENGTH on= , , , , , , and = , , , , ,

10/12/2016

9

Step 4: Constructing an LCS• The table returned by LCS-LENGTH enables us

to quickly construct an LCS of and• We simply begin at [ , ] and trace through the

table by following the arrows• Whenever we encounter a “ ” in entry , , it

implies that = is an element of the LCS thatLCS-LENGTH found

• With this method, we encounter the elements ofthis LCS in reverse order

• A recursive procedure prints out an LCS ofand in the proper, forward order

12-Oct-16MAT-72006 AADS, Fall 2016 344

• The square in row and column contains the value of, and the appropriate arrow for the value of [ , ]

• The entry 4 in [7,6] — the lower right-hand corner ofthe table — is the length of an LCS , , ,

• For , > 0, entry , depends only on whether =and the values in entries 1, , , 1 , and

1, 1 , which are computed before ,• To reconstruct the elements of an LCS, follow the [ , ]

arrows from the lower right-hand corner• Each “ ” on the shaded sequence corresponds to an

entry (highlighted) for which = is a member of anLCS

12-Oct-16MAT-72006 AADS, Fall 2016 345

10/12/2016

10

Improving the code

• Each , entry depends on only 3 other tableentries: 1, , , 1 , and 1, 1

• Given the value of , , we can determine in(1) time which of these three values was used

to compute , , without inspecting table• We can reconstruct an LCS in ( + ) time• The auxiliary space requirement for computing

an LCS does not asymptotically decrease, sincewe need space for the table anyway

12-Oct-16MAT-72006 AADS, Fall 2016 346

• We can, however, reduce the asymptotic spacerequirements for LCS-LENGTH, since it needsonly two rows of table at a time– the row being computed and the previous row

• This improvement works if we need only thelength of an LCS– if we need to reconstruct the elements of an LCS,

the smaller table does not keep enoughinformation to retrace our steps in ( + ) time

12-Oct-16MAT-72006 AADS, Fall 2016 347

10/12/2016

11

15.5 Optimal binary search trees

• We are designing a program to translate text• Perform lookup operations by building a BST with

words as keys and their equivalents as satellite data• We can ensure an (lg ) search time per occurrence by

using a RBT or any other balanced BST• A frequently used word may appear far from the root

while a rarely used word appears near the root• We want frequent words to be placed nearer the root• How do we organize a BST so as to minimize the

number of nodes visited in all searches, given that weknow how often each word occurs?

12-Oct-16MAT-72006 AADS, Fall 2016 348

• What we need is an optimal binary search tree• Formally, given a sequence = , , … , of

distinct sorted keys ( < < ), wewish to build a BST from these keys

• For each key , we have a probability that asearch will be for

• Some searches may be for values not in , sowe also have + 1 “dummy keys” , , … ,representing values not in

• In particular, represents all values less than, represents all values greater than

12-Oct-16MAT-72006 AADS, Fall 2016 349

10/12/2016

12

• For = 1, 2, … , 1, the dummy keyrepresents all values between and

• For each dummy key , we have a probabilitythat a search will correspond to

12-Oct-16MAT-72006 AADS, Fall 2016 350

• Each key is an internal node, and eachdummy key is a leaf

• Every search is either successful ( nds a key )or unsuccessful ( nds a dummy key ), and sowe have

+ = 1

• Because we have probabilities of searches foreach key and each dummy key, we candetermine the expected cost of a search in agiven BST

12-Oct-16MAT-72006 AADS, Fall 2016 351

10/12/2016

13

• Let us assume that the actual cost of a searchequals the number of nodes examined, i.e., thedepth of the node found by the search in + 1

• Then the expected cost of a search in ,E search cost in =

depth + 1 + (depth + 1)

= 1 + depth + depth ,

where depth denotes a node’s depth in tree12-Oct-16MAT-72006 AADS, Fall 2016 352

Node Depth Probability Contribution1 0.15 0.300 0.10 0.102 0.05 0.151 0.10 0.202 0.20 0.602 0.05 0.152 0.10 0.303 0.05 0.203 0.05 0.203 0.05 0.203 0.10 0.40

Total 2.80

12-Oct-16MAT-72006 AADS, Fall 2016 353

10/12/2016

14

• For a given set of probabilities, we wish to constructa BST whose expected search cost is smallest

• We call such a tree an optimal binary search tree• An optimal BST for the probabilities given has

expected cost 2.75• An optimal BST is not necessarily a tree whose

overall height is smallest• Nor can we necessarily construct an optimal BST

by always putting the key with the greatestprobability at the root

• The lowest expected cost of any BST with at theroot is 2.85

12-Oct-16MAT-72006 AADS, Fall 2016 354

12-Oct-16MAT-72006 AADS, Fall 2016 355

10/12/2016

15

Step 1: The structure of an optimal BST• Consider any subtree of a BST• It must contain keys in a contiguous range

, … , , for some• In addition, a subtree that contains keys , … ,

must also have as its leaves the dummy keys, … ,

• If an optimal BST has a subtree containingkeys , … , , then this subtree must beoptimal as well for the subproblem with keys

, … , and dummy keys , … ,

12-Oct-16MAT-72006 AADS, Fall 2016 356

• Given keys , … , , one of them, say , is theroot of an optimal subtree containing these keys

• The left subtree of the root contains the keys, … , (and dummy keys , … , )

• The right subtree contains the keys , … ,(and dummy keys , … , )

• As long as we– examine all candidate roots , where ,– and determine all optimal BSTs containing

, … , and those containing , … , ,we are guaranteed to nd an optimal BST

12-Oct-16MAT-72006 AADS, Fall 2016 357

10/12/2016

16

• Suppose that in a subtree with keys , … , , weselect as the root

• ’s left subtree contains the keys , … ,• Interpret this sequence as containing no keys• Subtrees, however, also contain dummy keys• Adopt the convention that a subtree containing

keys , … , has no actual keys but doescontain the single dummy key

• Symmetrically, if we select as the root, then’s right subtree contains no actual keys, but it

does contain the dummy key

12-Oct-16MAT-72006 AADS, Fall 2016 358

Step 2: A recursive solution• We pick our subproblem domain as nding an

optimal BST containing the keys , … , , where1, , and 1

• Let us de ne [ , ] as the expected cost ofsearching an optimal BST containing the keys

, … ,• Ultimately, we wish to compute [1, ]• The easy case occurs when = 1• Then we have just the dummy key• The expected search cost is , 1 =

12-Oct-16MAT-72006 AADS, Fall 2016 359

10/12/2016

17

• When > , we need to select a root fromamong , … , and make an optimal BST withkeys , … , as its left subtree and an optimalBST with keys , … , as its right subtree

• What happens to the expected search cost of asubtree when it becomes a subtree of a node?– Depth of each node increases by 1– Expected search cost of this subtree increases by

the sum of all the probabilities in it• For a subtree with keys , … , , let us denote

this sum of probabilities as, = +

12-Oct-16MAT-72006 AADS, Fall 2016 360

• Thus, if is the root of an optimal subtreecontaining keys , … , , we have

, = + , 1 + , 1+( + 1, + + 1, )

• Noting that, = , 1 + + ( + 1, )

we rewrite, = , 1 + + 1, + ( , )

• We choose the root that gives the lowestexpected search cost: , =

if = 1min , = , 1 + + 1, + ( , ) if

12-Oct-16MAT-72006 AADS, Fall 2016 361

10/12/2016

18

• The , values give the expected search costsin optimal BSTs

• To help us keep track of the structure of optimalBSTs, we de ne root , , for , to bethe index for which is the root of an optimalBST containing keys , … ,

• Although we will see how to compute the valuesof root , , we leave the construction of anoptimal binary search tree from these values asan exercise

12-Oct-16MAT-72006 AADS, Fall 2016 362

Step 3: Computing the expected search cost ofan optimal BST• We store , values in a table 1. . + 1,0. .• The rst index needs to run to + 1 because to

have a subtree containing only the dummy key, we need to compute and store + 1,

• The second index needs to start from 0 becauseto have a subtree containing only the dummykey , we need to compute and store 1,0

12-Oct-16MAT-72006 AADS, Fall 2016 363

10/12/2016

19

• We use only the entries , for which 1• We also use a table root , , for recording the

root of the subtree containing keys , … ,• This table uses only the entries• We also store the , values in a table

[1. . + 1,0. . ]• For the base case, we compute , 1 =• For , we compute

, = , 1 + +• Thus, we can compute the ( ) values of

, in (1) time each12-Oct-16MAT-72006 AADS, Fall 2016 364

OPTIMAL-BST( , , )1. let 1. . + 1,0. . , [1. . + 1,0. . ], root 1. . , 1. . be new tables2. for = 1 to + 13. , 1 =4. , 1 =5. for = 1 to6. for = 1 to + 17. = + 18. ,9. , = , 1 + +10. for = to11. = , 1 + + 1, + [ , ]12. if < [ , ]13. , =14. root , =15.return and root

12-Oct-16MAT-72006 AADS, Fall 2016 365

10/12/2016

20

12-Oct-16MAT-72006 AADS, Fall 2016 366

• The OPTIMAL-BST procedure takes ( ) time,just like MATRIX-CHAIN-ORDER

• Its running time is ( ), since its for loops arenested three deep and each loop index takes onat most values

• The loop indices in OPTIMAL-BST do not haveexactly the same bounds as those in MATRIX-CHAIN-ORDER, but they are within 1 in alldirections

• Thus, like MATRIX-CHAIN-ORDER, the OPTIMAL-BST procedure takes ( ) time

12-Oct-16MAT-72006 AADS, Fall 2016 367

10/12/2016

21

16 Greedy Algorithms

• Optimization algorithms typically go through asequence of steps, with a set of choices at each

• For many optimization problems, using dynamicprogramming to determine the best choices isoverkill; simpler, more ef cient algorithms will do

• A greedy algorithm always makes the choicethat looks best at the moment

• That is, it makes a locally optimal choice in thehope that this choice will lead to a globallyoptimal solution

12-Oct-16MAT-72006 AADS, Fall 2016 368

16.1 An activity-selection problem

• Suppose we have a set = , , … , ofproposed activities that wish to use a resource(e.g., a lecture hall), which can serve only oneactivity at a time

• Each activity has a start time and a nishtime , where <

• If selected, activity takes place during the half-open time interval [ , )

12-Oct-16MAT-72006 AADS, Fall 2016 369

10/12/2016

22

• Activities and are compatible if theintervals [ , ) and [ , ) do not overlap

• I.e., and are compatible if or• We wish to select a maximum-size subset of

mutually compatible activities• We assume that the activities are sorted in

monotonically increasing order of nish time:

12-Oct-16MAT-72006 AADS, Fall 2016 370

• Consider, e.g., the following set of activities:

• For this example, the subset , ,consists of mutually compatible activities

• It is not a maximum subset, however, since thesubset , , , is larger

• In fact, it is a largest subset of mutuallycompatible activities; another largest subset is

, , ,12-Oct-16MAT-72006 AADS, Fall 2016 371

1 2 3 4 5 6 7 8 9 10 111 3 0 5 3 5 6 8 8 2 124 5 6 7 9 9 10 11 12 14 16

10/12/2016

23

The optimal substructure of the activity-selection problem• Let be the set of activities that start after

nishes and that nish before starts• We wish to nd a maximum set of mutually

compatible activities in• Suppose that such a maximum set is , which

includes some activity• By including in an optimal solution, we are left

with two subproblems: nding mutuallycompatible activities in the set and ndingmutually compatible activities in the set

12-Oct-16MAT-72006 AADS, Fall 2016 372

• Let = and = , so that– contains the activities in that nish

before starts and– contains the activities in that start after

nishes• Thus, = , and so the

maximum-size set in consists of| = | + + 1 activities

• The usual cut-and-paste argument shows thatthe optimal solution must also includeoptimal solutions for and

12-Oct-16MAT-72006 AADS, Fall 2016 373

10/12/2016

24

• This suggests that we might solve the activity-selection problem by dynamic programming

• If we denote the size of an optimal solution forthe set by [ , ], then we would have therecurrence

, = , + , + 1• Of course, if we did not know that an optimal

solution for the set includes activity , wewould have to examine all activities in to ndwhich one to choose, so that

, =0 if

max , = , + , + 1 if

12-Oct-16MAT-72006 AADS, Fall 2016 374

Making the greedy choice• For the activity-selection problem, we need

consider only the greedy choice• We choose an activity that leaves the resource

available for as many other activities as possible• Now, of the activities we end up choosing, one of

them must be the rst one to nish• Choose the activity in with the earliest nish

time, since that leaves the resource available foras many of the activities that follow it as possible

• Activities are sorted in monotonically increasingorder by nish time; greedy choice is activity

12-Oct-16MAT-72006 AADS, Fall 2016 375

10/12/2016

25

• If we make the greedy choice, we only have to ndactivities that start after nishes

• < and is the earliest nish time of anyactivity no activity can have a nish time

• Thus, all activities that are compatible with activitymust start after nishes

• Let = : be the set of activities thatstart after nishes

• Optimal substructure: if is in the optimal solution,then an optimal solution to the original problemconsists of and all the activities in an optimalsolution to the subproblem

12-Oct-16MAT-72006 AADS, Fall 2016 376

Theorem 16.1 Consider any nonempty subproblem ,and let be an activity in with the earliest nish time.Then is included in some maximum-size subset ofmutually compatible activities of .

Proof Let be a max-size subset of mutually compatibleactivities in , and let be the activity in with theearliest nish time. If = , we are done, since is in amax-size subset of mutually compatible activities of .If , let the set = . The activitiesin are disjoint because the activities in are disjoint,is the rst activity in to nish, and . Since | | =| |, we conclude that is a maximum-size subset ofmutually compatible activities of and includes .

12-Oct-16MAT-72006 AADS, Fall 2016 377

10/12/2016

26

• We can repeatedly choose the activity thatnishes rst, keep only the activities compatible

with this activity, and repeat until no activitiesremain

• Moreover, because we always choose theactivity with the earliest nish time, the nishtimes of the activities we choose must strictlyincrease

• We can consider each activity just once overall,in monotonically increasing order of nish times

12-Oct-16MAT-72006 AADS, Fall 2016 378

A recursive greedy algorithm

RECURSIVE-ACTIVITY-SELECTOR( , , , )1. + 12. while and [ ] < [ ] // nd the rst

// activity in to nish3. + 14. if5. return RECURSIVE-ACTIVITY-

SELECTOR( , , , )6. else return

12-Oct-16MAT-72006 AADS, Fall 2016 379

10/12/2016

27

12-Oct-16MAT-72006 AADS, Fall 2016 380

An iterative greedy algorithm

GREEDY-ACTIVITY-SELECTOR( , )1. .2.3. 14. for 2 to5. if [ [ ]6.7.8. return

12-Oct-16MAT-72006 AADS, Fall 2016 381

10/12/2016

28

• The set returned by the callGREEDY-ACTIVITY-SELECTOR( , )

is precisely the set returned by the callRECURSIVE-ACTIVITY-SELECTOR( , , , )

• Both the recursive version and the iterativealgorithm schedule a set of activities in ( )time, assuming that the activities were alreadysorted initially by their nish times

12-Oct-16MAT-72006 AADS, Fall 2016 382

Documents

15.4 Longest common subsequence - TUNIelomaa/teach/AADS-2016-7.pdf• It is not alongest common subsequence (LCS) of :and ; • The sequence $, %, $, #is also common to both :and ;and