Universal Levenshtein Automata. Building and Properties · PDF fileSofia University St. Kliment Ohridski Faculty of Mathematics and Informatics Department of Mathematical Logic and

  • Upload
    lyliem

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

  • Sofia University St. Kliment OhridskiFaculty of Mathematics and Informatics

    Department of Mathematical Logic and Applications

    Universal Levenshtein Automata.Building and Properties

    A thesis submitted for the degree of Master of Computer Science

    by Petar Nikolaev Mitankin

    supervisor: Dr. Stoyan Mihov

    Sofia, 2005

  • Contents

    1 Introduction. 2

    2 Levenshtein distances. Properties. 3

    3 Nondeterministic finite Levenshtein automata for fixed word. 8

    4 Deterministic finite Levenshtein automata for fixed word. 13

    5 Universal Levenshtein automata. 28

    6 Building of A,n , A,tn and A

    ,msn . 48

    6.1 Summarized pseudo code . . . . . . . . . . . . . . . . . . . . . . 486.2 Detailed pseudo code . . . . . . . . . . . . . . . . . . . . . . . . . 496.3 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4 Some final results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    7 Minimality of A,n , A,tn and A

    ,msn . 59

    8 Some properties of A,n . 72

    1

  • 1 Introduction.

    One possible measure for the proximity of two strings is the so-called Leven-shtein distance (known also as edit distance), based on primitive edit operations.Primitive edit operations are replacement of one symbol with another (substi-tution), deletion of a symbol, insertion of a symbol and others. The distancebetween two strings w and v is defined as the minimal number of the primitiveedit operations that transform w into v.

    This master thesis gives a detailed formal review of the so-called universalLevenshtein automaton. The input word for this automaton is a sequence of bitvectors i(w, v) which is computed by given two words w and v. The automatonrecognizes i(w, v) iff the distance between w and v is not greater than n.

    The greatest advantage of the universal Levenshtein automata A,n is ob-tained when we have to extract from a dictionary all words v that are closeenough to a given word w. If the dictionary is repesented as a finite determinis-tic automaton D we can traverse parallelly the two automata A,n and D to findall these words. Description of this algorithm and its modified version calledforward-backward method, which is extremely fast in practice, can be found in[MSFASLD].

    Short review of the contentsSection 2 - definition of three different Levenshtein distances based on the

    number of edit operations. Section 3 - definition of the nondeterministic Leven-shtein automaton AND,n (w) and proof that the language of A

    ND,n (w) consists

    of all strings x such that the distance between w and x is not greater than n.Section 4 - definition of the deterministic Levenshtein automaton AD,n (w) andproof that the languages of AND,n (w) and A

    D,n (w) are equal. The universal

    Levenshtein automaton A,n is defined in section 5. Section 6 - the algorithmfor its building. Section 7 - proof that A,n is minimal. Section 8 - some prop-erties of A,n .

    RemarksThe aim of this master thesis is to review the deterministic Levenshtein

    automata and the universal Levenshtein automata presented by their authorsMihov and Schulz in [SMFSCLA] and [MSFASLD]. The main efforts in thismaster thesis are concentrated on the strict proofs and the details.

    This paper is a draft transation of the original text with additional commentsand more figures. The original can be found at [ORIG].

    The term Levenshtein distances is used in the text for dL, dtL and d

    msL ,

    although for the words w1 = abcd, w2 = abdc and w3 = bdac the triangleinequality is not satisfied for dtL. d

    tL(abcd, abdc) = 1, d

    tL(abdc, bdac) = 2, but

    dtL(abcd, bdac) = 4.

    2

  • 2 Levenshtein distances. Properties.

    Let be a finite set of letters.

    Definition 1 dL : N

    Let v, w, v, w and a, b .1) v = or w =

    dL(v, w)def= max(|v|, |w|)

    2) |v| 1 and |w| 1Let v = av and w = bw.

    dL(v, w)def= min( if(a = b, dL(v

    , w),),1 + dL(v

    , bw),1 + dL(av

    , w),1 + dL(v

    , w) )

    Notations Here and in what follows the value of the expressionif(Condition, V alueIfConditionIsTrue, V alueIfConditionIsFalse)is V alueIfConditionIsTrue if Condition is satisfied and V alueIfConditionIsFalseotherwise. |x| denotes the length of x.

    The function dL is called Levenshtein distance. dL(v, w) is called Leven-

    shtein distance between the words v and w. The Levenshtein distance betweenthe words v and w is the minimal number of primitive edit operations thattransform v into w. The primitive edit operations are deletion of a letter, inser-tion of a letter and substitution of one letter with another.

    Definition 2 : N Let k N , x1, x2, ..., xk and t N .

    x1x2...xk tdef=

    { if t kxt+1xt+2...xk otherwise

    Treating the transposition of two letters also as a primitive edit operationwe receive the following definition of Levenshtein distance extended with trans-position:

    Definition 2 dtL : N

    Let v, w, v, w and a, b, a1, b1 .1) v = or w =

    dtL(v, w)def= max(|v|, |w|)

    2) |v| 1 and |w| 1Let v = av and w = bw.

    3

  • dtL(v, w)def= min( if(a = b, dtL(v

    , w),),1 + dtL(v

    , bw),1 + dtL(av

    , w),1 + dtL(v

    , w),if(a1 < v & b1 < w & a = b1 & a1 = b, 1 + dtL(v 2, w 2),) )

    Notations We use c < d to denote that c is a prefix of d if c and d are words.

    The function dtL is called Levenshtein distance extended with transposition.

    When merging of two letters into one and splitting of one letter into twoother letters are considered as primitive edit operations we use the followingdefinition of Levenshtein distance extended with merge and split :

    Definition 3 dmsL : N

    Let v, w, v, w and a, b .1) v = or w =

    dmsL (v, w)def= max(|v|, |w|)

    2) |v| 1 and |w| 1Let v = av and w = bw.

    dmsL (v, w)def= min( if(a = b, dmsL (v

    , w),),1 + dmsL (v

    , bw),1 + dmsL (av

    , w),1 + dmsL (v

    , w),if(|w| 2, 1 + dmsL (v, w 2),),if(|v| 2, 1 + dmsL (v 2, w),) )

    The function dmsL is called Levenshtein distance extended with merge andsplit.

    Notations We use as a metasymbol. For example dL denotes dL, d

    tL or

    dmsL if {, t,ms}.

    Proposition 1 Let {, t,ms} and v, w . Then dL(v, w) = 0 v = w .

    Proof) Let v = w = x. Using induction on |x| we prove that dL(x, x) = 0.1) |x| = 0dL(x, x) = d

    L(, ) = 0

    2) Induction hypothesis: dL(x, x) = 0Let a . We prove that dL(ax, ax) = 0:

    dL(ax, ax) = min( if(a = a, dL(x, x),),

    ... ) =

    4

  • min( if(a = a, 0,),... ) = 0

    ) With induction on |v| we prove that dL(v, w) = 0 v = w.1) v = . Let dL(v, w) = 0. d

    L(v, w) = max(|v|, |w|) = 0. Hence w = .

    2) Induction hypothesis: w (dL(v, w) = 0 v = w)Let a and w . We have to prove that dL(av,w) = 0 av = w.

    Let dL(av,w) = 0. From the definition of dL it follows that |w| 1. Let

    b , w and w = bw. From the definition of dL it follows that a = b anddL(v, w

    ) = 0. The induction hypothesis implies that v = w. Therefore av = w.

    Proposition 2 Let {, t,ms} and v, w . Then dL(v, w) = dL(w, v).

    The proof of the Proposition 2 is straightforward.

    Remark As we know Proposition 1 and Proposition 2, it remains to provethe triangle inequality for dL ( d

    L(v, w) d

    L(v, x) + d

    L(x, w) ) to show that

    dL is distance. But this property is used nowhere in this paper. Thats why wedont prove it.

    Definition 4 Let {, t,ms}.LLev : N P ()LLev(n,w)

    def= {v|dL(v, w) n}

    We can find the definitions of LLev, LtLev and L

    msLev in [SMFSCLA].

    Proposition 3 Let {, t,ms}. Let a and v, w . ThendL(v, w) = k d

    L(av,w) k + 1.

    Proof Let dL(v, w) = k.1) w = dL(av,w) = d

    L(av, ) = k + 1

    2) |w| 1Form the definition of dL it follows that d

    L(av,w) 1 + d

    L(v, w) = k + 1.

    Proposition 4 Let {, t,ms}. Let a,w1 and v, w . ThendL(v, w) = k d

    L(av,w1w) k + 1.

    Proof Let dL(v, w) = k. From the definition of dL it follows that d

    L(av,w1w)

    1 + dmsL (v, w) = k + 1.

    Proposition 5 Let {, t,ms}. Let w1 and v, w . ThendL(v, w) = k d

    L(v, w1w) k + 1.

    Proof Proposition 5 follows directly from Proposition 3 and Proposition 2.

    5

  • Proposition 6 Let {, t,ms}. Let w1 and v, w . ThendL(v, w) = k d

    L(w1v, w1w) k.

    Proof Let dL(v, w) = k. From the definition of dL it follows that d

    L(w1v, w1w)

    dmsL (v, w) = k.

    Proposition 7 Let {, t,ms}. Let w , w = w1w2...wp, p 1 andn > 0. Then

    LLev(n,w) .LLev(n 1, w)

    .LLev(n 1, w2w3...wp)

    LLev(n 1, w2w3...wp)

    w1.L

    Lev(n,w2w3...wp).

    Proof From Properties 3, 4, 5 and 6 it follows respectively thatLLev(n,w) .L

    Lev(n 1, w) ,

    LLev(n,w) .LLev(n 1, w2w3...wp) ,

    LLev(n,w) LLev(n 1, w2w3...wp) and

    LLev(n,w) w1.LLev(n,w2w3...wp) .

    Therefore

    LLev(n,w) .LLev(n 1, w)

    .LLev(n 1, w2w3...wp)

    LLev(n 1, w2w3...wp)

    w1.L

    Lev(n,w2w3...wp).

    We show how to extend

    A = .LLev(n 1, w)

    .LLev(n 1, w2w3...wp)

    LLev(n 1, w2w3...wp)

    w1.LLev(n, w2w3...wp)

    to LLev(n,w). First we define R as an extension of A and afterwards we prove

    that R = LLev.

    Definition 5 Let {, t,ms}.R : N+ + P ()Let w , w = w1w2...wp, p 1 and n 1.1) =

    R(n,w)def= .LLev(n 1, w)

    .LLev(n 1, w2w3...wp)

    LLev(n 1, w2w3...wp)

    w1.L

    Lev(n,w2w3...wp)

    6

  • 2) = t

    Rt(n,w)def= .LtLev(n 1, w)

    .LtLev(n 1, w2w3...wp)

    LtLev(n 1, w2w3...wp)

    w1.L

    tLev(n,w2w3...wp)

    i