19
Greedy Algorithms for the Shortest Common Superstring Overview by Anton Nesterov Saint Petersburg State University Russia Original paper by •A. Frieze, Carnegie Mellon University, USA •W. Szpankowski, Purdue University, USA

Greedy Algorithms for the Shortest Common Superstring

  • Upload
    kaemon

  • View
    22

  • Download
    1

Embed Size (px)

DESCRIPTION

Greedy Algorithms for the Shortest Common Superstring. Overview by Anton Nesterov Saint Petersburg State University Russia. Original paper by A. Frieze, Carnegie Mellon University, USA W. Szpankowski, Purdue University, USA. Contents of the report. Description of the problem - PowerPoint PPT Presentation

Citation preview

Page 1: Greedy Algorithms for the Shortest Common Superstring

Greedy Algorithms for the Shortest Common Superstring

Overview byAnton Nesterov

Saint Petersburg State UniversityRussia

Original paper by•A. Frieze, Carnegie Mellon University, USA•W. Szpankowski, Purdue University, USA

Page 2: Greedy Algorithms for the Shortest Common Superstring

Contents of the report

• Description of the problem

• Formulation of main results

• Ideas of the proof

Page 3: Greedy Algorithms for the Shortest Common Superstring

Shortest common superstring Problem(SCS)

• Given a collection of strings:

• We want to find a superstring that contains each as a substring

nxxx ,,, 21

ix

Page 4: Greedy Algorithms for the Shortest Common Superstring

Example

• Three words:

fear nature

arena

• Superstring:

fearenature

Page 5: Greedy Algorithms for the Shortest Common Superstring

Algorithms

• It is known, that the problem is NP-hard.

• It is of a great interest to develop a good approximation algorithm.

• Some greedy algorithms will be presented

Page 6: Greedy Algorithms for the Shortest Common Superstring

Definitions

• We define an overlap and a special sum of two strings

},...,,{ 21 M

rxxxx 21

• Alphabet

}1,:max{),( jixyjyx ijri syyyy 21

skkr yyyxxxyx 2121

Page 7: Greedy Algorithms for the Shortest Common Superstring

Example

turenaare yx

turena naare yx

2),( yx

Page 8: Greedy Algorithms for the Shortest Common Superstring

Total optimal overlap

• Let us have a set of n strings

• Assume that we had solved the SCS

• The minimal superstring is S

Sxn

i

ioptn

1

Page 9: Greedy Algorithms for the Shortest Common Superstring

Descriptions of algorithms• Generic greedy algorithm:

;0};,,,,{ 321 grn

nxxxxI repeat

;;, yxzIyx choose };{}),{\( zyxII );,( yxgr

ngrn

1I until• GREEDY

•In step (*) we choose x,y in order to maximize o(x,y);

• RGREEDY •In step (*) x is a string z from the previous iteration. It means that we have only one long string, which grows by addition of strings at the right-hand end.

(*)

Page 10: Greedy Algorithms for the Shortest Common Superstring

Bernoulli model• All strings are of the same length• Symbols are independent on the previous ones.• Let

be the associated entropy for the Bernoulli model where

m

iii ppH

1

log

)( iii xp

Page 11: Greedy Algorithms for the Shortest Common Superstring

Main result

• Theorem. Let us consider the SCS problem under the Bernoulli model.

Then, with high probability:

provided where

,1

loglim,

1log

limHnnHnn

grn

n

optn

n

nP

x i loglog

4

M

jipP

1

2

Page 12: Greedy Algorithms for the Shortest Common Superstring

Another models• Markovian model:

each symbol depends only on the previous one

• Mixing modelWhen we are solving more complicated problems using SCS, two previous models are too unrealistic. Then it needs to use mixing model. The main idea of it is as follows:

Let we have the string: ...100110110010

The farther the symbol the lesser the dependence on it

Page 13: Greedy Algorithms for the Shortest Common Superstring

Compression• SCS can be used to compress strings.

• Instead of storing all strings of total length nℓ we can store the SCS and n pointers indicating the beginning of an original strings plus lengths of all strings.

• Compression ratio will be:

• But when the length of a string grows faster then logn, then compression ratio will be 1

nl

nnH

nlnnH

nlCn

)log1

(loglog1

2

Page 14: Greedy Algorithms for the Shortest Common Superstring

Ideas of the proof• First of all we shall show that it is unlikely that there is a

pair of strings with overlap more than half of their length

• Let E denotes the event that there is no such a pair• If , then

provided

2),(l

xx ji

)1()()()( 2)log(

2

22

PK

l

nPl

k

kn

PK

log4

nKl log

Page 15: Greedy Algorithms for the Shortest Common Superstring

Two halves of a string

• Let’s consider a part of our superstring:

...ancneanaosaasunanssana..• Overlaps of two nearby strings are marked by red.• Each string has two marked overlaps the “tail” and the “nose”.• Knowing that such overlaps practically never takes more than

half of the string, we would divide every string into two parts: the first half and the second one, each has a length of l/2.

• Then, if we want to consider an overlap of two strings a and b, we should operate only with the first half of a and the second part of b.

Page 16: Greedy Algorithms for the Shortest Common Superstring

RGREEDY and trees• We now consider a tree process that

is equal to RGREEDY

• Tree T would be an infinite rooted M-ary tree.

• M (size of an alphabet) edges leading down from each vertex will be labeled with

• Thus, each vertex of depth d is identified with string of length d.

M ,...,, 21

Page 17: Greedy Algorithms for the Shortest Common Superstring

Modeling RGREADY• For the each string we

will mark each vertex with a number of strings that has prefix associated with this vertex.

• “k” means that there are k strings starting from 1M

Page 18: Greedy Algorithms for the Shortest Common Superstring

Example1. We will climb down to our tree, following letters from the second half of the first string, to the depth of l/2.2. We’ll stop at the highest positive integer, let’s call it t.3. So, we can find t strings that have suffixes equal to the prefix of the first string. 4. Let’s consider one of these t stings in the similar way (see 1.). We’ll repeat such procedure n times.

Page 19: Greedy Algorithms for the Shortest Common Superstring

GREEDY

• Let D be the digraph – with edges weights

• We sort the edges A into

so that• Thus the problem is to find a path along the edges,

which has the maximum weight.

)],([ An

),(,ji

ji ab 2

21 ,,...,, nNeee N )()( 1 ii ee