Upload
vantuyen
View
216
Download
0
Embed Size (px)
Citation preview
IGAUNA (IMPROVED GLOBAL SEQUENCE
ALIGNMENT USING NON-EXACT ANCHORS)
All rigI11s r ( ~ s ( ~ \ w I . ' l l~ is work U I I I J ~ uol. I)(!
~ ( ~ ~ ) L o ( I ~ L ( ~ I Y I 111 \vliol(! or i l l st,, l)y ~ ) l ~ o t o ( ~ i ) ~ ) y
or o t . l ~ c ~ rllcwrls, w i tho~~t 1 1 1 ( ~ ~wr~llissioil o f tlw i ~ ~ ~ t , l l ~ ) r .
APPROVAL
Name: IlIwsoutl H i ~ r i ~ ti
Dcgrce: M A S T E R OF SCIENCE
Titlc of thesis: IC;i\UMA (Irtlyrovctl Global Soquc!~ice Aligrmlc~~t IJsing NOII-
cXsac:t A11c:llors)
Datc Approved: l o ~ + I ~ / Z ~ C - F
2% SIMON FRASER @ ,,,,,,,l~Iibrary &.g
DECLARATION OF PARTIAL COPYRIGHT LICENCE
The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.
The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the "Institutional Repository" link of the SFU Library website <www.lib.sfu.ca> at: ~http:/lir.lib.sfu.calhandlell8921112>) and, without changing the content, to translate the thesislproject or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.
The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.
It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.
Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.
The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.
Simon Fraser University Library Burnaby, BC, Canada
Revised: Spring 2007
Abstract
Contents
. . Approval 11
... Abstract 111
Quotation v
Contents vii
List of Tables x
List of Figurcs xi
1 Introduction 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 I\[olivnl.iotls 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 01tr Cont.ril)ut,ions 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Tl~clsis Orgarri./,ill.iott 3
2 Background 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Biolngic:nlB;~c:l<gror~tltl 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Biologic.al 'li.rlr~s 6
. . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Cot~tp~ttt!~. Scic~~c:c. Ri~(:kgl.o~t~ld S
3 P r e v i o u s W o r k 0 1 1 Globa l S c q u e n c c Alignment 15
3.1 Scorillg hdct. l ids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 L). y t~ i~ t l l i~ . l'rogral~ll~ling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 ~ \ ~ l ~ l ~ o r - E i ~ s c ~ t l / I I i t . Mel.hotls . . . . . . . . . . . . . . . . . . . . . . . . . . . . lr3
3.3.1 FAST!\ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I!)
3.3.3 CI.IAOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.4 1,ACAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
9') . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 GLASS ,,
3.3.6 htIlJh1111o1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . 7 C111sf aI\Y 23
3.3.8 AVID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4
13 . X . (3 h. l C: i\ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1
4 Gc~wt.ic. Algor i t l~n~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 I l i t l t lc~~ .\ 1ilrk0~ l ~ l ~ t l ~ l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4 Gcrlcra l A l g o r i t h m s F o r A n c h o r - B a s e d R4etl iods 27
'1.1 U~~ilcling S ~ ~ f f i s '1'1.1~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 High Lc.vcl Ukl~mc!t~'s Algorit.lin-1 . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 S~wc: t lu~~ Tecl~l~icl~~c., Part 1 . . . . . . . . . . . . . . . . . . . . . . . . 2'3
. . . . . . . . . . . . . . . . . . . . . . . . 4.1.:3 S p c x ~ l r ~ p T d m i c p ~ c , Pi1l.t. 2 33
4.2 Fincliug hlaxili~~~lr~-\Iic:igl~L At~cllor Set. . . . . . . . . . . . . . . . . . . . . . . 34
5 G A U N A 37
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 kIc:t. llotl tlesc.ript. ioll 37
5.2 Finding h;I:~xiln;d Il~cx;~c:t Mat~c l~ i s . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Sc lcc t i~~g A ~ r l l o r s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.1 Fitltlit~g 1, i ~ r g ( > ~ t rli)t. i l l \l.?\igI~t . No~~-c.rossit~g A1lc11o1.s . . . . . . . . . . 44
5.4 C:losillg Thc. GILI)S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 .5 C;I\UNI\ I ' ~ I ~ ; I I I ~ o ~ . o ~ . s 45
7 IGATJNA 50
7.1 h.Io.~sul.illg 1\11 Alig11111~11 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.2 I ~ n l ) r i ) \ ~ c ~ ~ i c ~ ~ ~ t s to CAUNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5') 7.2.1 Esoll \Akigl~t. Aclj~lstlrlclll . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2.2 13ri111(:11illg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.2.9 Pwri~lllctcr O p t h i z i l t i o ~ ~ . . . . . . . . . . . . . . . . . . . . . . . . . . 54
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Optillla1 Alignlll~'l~t. 57
7.4 1GAlJX.A Pi~ran~c:t.nrs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
8 IGAUNA Results and Conclusion 50
1 Eq)c~ri lno~~t;d Settiligs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5!)
5.2 I'i~r.l.lll(!t.(:l. Scttiligs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
8.2.1 I<-valr~cSet. s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8.3 A l i g l ~ ~ ~ ~ ~ l l t . R.esrllts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8 . 1 hI(w~ory U s q y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
8.3.2 Spcctl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.3.13 Qrli11it.y Of A l i g l ~ ~ l ~ c l ~ t s . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8 .. %.4 IGi\I!NA I ~ n [ ) r o \ ~ c ~ ~ ~ ~ c ~ l ~ t s CC)III~):L~(:(I 'Ii) GAUNA . . . . . . . . . . . . . 68
8.3.5 S I I I I I I I I ~ I ~ Y of r~osults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.4 Co~~cllwiori nlltl l7ilt3r~re LVork . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography 73
List of Tables
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 GAUNA Specificity 47
6.2 GAUNA Global Aligurl~oiit llctsults . . . . . . . . . . . . . . . . . . . . . . . . 4!J
8.1 I<-valrw ERwt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1
8.2 R'Iousc Dog Al i l ? ;~~ni (~n t r t (~11 ( . s . . . . . . . . . . . . . . . . . . . . . . . . . . 6:)
8.3 h~louso C11ic:kw Aligritilciit Rrsu1t.s . . . . . . . . . . . . . . . . . . . . . . . . ti3
8.4 I I u r i i ; ~ ~ i Dog ~-\ l i : l ; l i r l le~~t I l e s ~ ~ l t s . . . . . . . . . . . . . . . . . . . . . . . . . . 6.: 1
5.5 IIuinmr Cliickcri Aligliiiic~it . Rrs111t.s . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6 I l ~ ~ ~ i l i i - l ~ MOUSC i ? \ l i g ~ l r ~ ~ c ~ ~ t . R ~ s l ~ l f . ~ . . . . . . . . . . . . . . . . . . . . . . . . . 6 5
S.7 H \ I ~ I I ; ~ I ~ R.i~t. A l i g l ~ l ~ ( . i ~ t . R(w1Ifs . . . . . . . . . . . . . . . . . . . . . . . . . . 6:)
List of Figures
sii
Introduction
Aft or tlw cliscovcry of DNA ill 1953 [5], our liriomlctlgc: of orgal~is~l is ant1 t h i r I )~~ i l t l i~ ig st.riic:- 7 .
I rlws 1 i i ~ v;istly growll, I his 1~1s i l l t,urli ros~ilLo(I i l l 1 . 1 ~ c:reat,iol~ ol' 1 1 c w I ~ t x ~ ~ c : l ~ c s of scicincc
tl(!tli(:ill.(ltl to s t d y i n g t.hc ~li;iili b u i l d i ~ ~ g I)loc:l<s of' life. bIolec111;ir 1)iology: gc!~lc:tics i i ~ i ( l KC-
~loliiics ilrc-: silc.11 llcw fioltls bra~icliilig from biology ulitl \vit,l~ t,lio c v c ~ groniil~g iilvolvc?nio~it, of
111ilI I i~ l l~i l l . i ( :~ a.nd c:oll~pr~l.cr sc:ic:licc in I.kicsc! fic:ltls? ilisc:il)lincs s l ~ c h ;is c:ornput,i~t i o l d 1)iology
hilvc~ I ) c ~ l i croiltctl. Gclictic:~ is t,ho a r w of Ihlogical st,~lcly c'onc:erllotl \vit , l i licwtlily alitl
wit.11 t . 1 ~ vilriatiolis lx!t.wc:el~ nrgai~isl~is t11a.t rcs111t fro111 it.. G~!llotliics is i L rcc(~~it . sci(:ntil-i(:
disci1)lilic wit,11 the i l i l i l of (Idini~ig :ln(l (:Ii;~rii~t,(:rizil~g t l i ~ (~)1111)1(>t,c gc~~cbtic 11iiaIw111) of i l l 1
orgialiis111.
Tllc i~ll icrmt ~~li~tllol~i;~t,ic:;\l s t , r \ ~ < : t . ~ ~ r ~ of DNA i111~ l t I ~ c i \ lgori t . l i~~~ic E)~O(:CSSPS 11scd t,o cx-
prrss prot,cins hiis Ictl t,o ii claw cx)llal~ori~tio~l I)c!l.\vecrl n~olec~llilr I~iology, comput,i?r sc:iellc:o,
~ i ~ i ( l i ili~tlloi~~iit,i(~s. 11s il. rcsillt.~ cc)l~il)l~t,i~t.io~ii~l biology I ~ ~ I s I ) N ! I ~ (,rc~ilt,o(I wliicli is [ i l l ~ I I ~ W -
tlist.iplirl:~ry ficlrl thnl i11)l)lic:s thc~ t,cx:ll~liclric:s o f c-:c-mpr~t,or sric.~icc: i11ltl applic!tl ~niltlhcni;~t.ic:s
to prol~lolus i~lspirccl Iy biology. As i t tlisciplilw, c,oiill)lit,iit.io~l;II biology is H rc!lat,i\~c!ly I I C W
ficlcl l)111, I ,II(w 1i;ls I ) W I ~ i~ virlmu;~l ( q > l o s i o ~ ~ o f work i l l 1111iwmiI,ics, g o v c r ~ ~ ~ w ~ i t , r(:smrch l;~.l>s
:\IICI 1 . 1 ~ : pl'i\,i~t(! S W ~ O ~ .
Tllis l iv I ( l is r c l ~ ~ t , i v d ~ . yo1111g i l l l t l l . ( ' ~ t ! i l l , ~ l ~ i l l tllis I i & l 111ili11Iy ~tilrt(!(l i\l't('~. <,st;~l)-
lisllnlerll, of Tlw IIluiiarl Gcliollle Pro,jc.c:t. (IIGP) ill 1990. 111 1!)W the U.S. I)('~)ilrt.lll(!t~t,
of E~icrgy (DOE) o s t , ~ ~ l ~ l i s l ~ ~ d t,Iic Microl~i;ll G C ~ W I I I P I'rogri1111 (MGP) ;IS a c o ~ ~ ~ p a ~ ~ i o u t,o
I IGP ~ I K I s inw 1.hc11 I ~ I ~ L I ~ J ~ ~ I C W cliscovcrivs llil.~(! I)(Y:II ~u i~c lc ill this fidd. A I ~ . l i o ~ ~ g h l'ro~ii l h
c :o l i i~)~~tor sc,ic!l~cc:/llliit,I~(\r~lilticill 1)oint of vicw, a I:w algoritlll~ls \vit , l l rcq)ec.t t.o sc?cluc~lc:c
a Ions 1.1 Mot iv t '
01, livillg O r g i ~ l l ~ s l l ~ ~ , (!ill1 11~?11) 11s l l l l ( ~ ( ! r s ~ ~ l l ~ ( ~ Illor(! i 1 ~ ) 0 1 1 ~ , t~iscxsos i l l l ( I (~~!sigll i l~g b(!t,t(!r (\rllgs.
Thcreforc mc ~~c:ccl sc:cluc:l~c-:o i ~ l i g ~ l i ~ ~ ~ ~ ~ t , ~ll(>t,llo(ls tl~ilt give 11s Iligl~ quillil,y ;~ligilnl(:~~ki 1 0 I ) ( '
u s ~ l L o cxl rilcl, l)iologic:i~lly vd11i11)Ic~ i ~ ~ f o r i ~ ~ i ~ t , i c ) ~ ~ i l l ) c ) ~ ~ t , l i v i ~ ~ g o r g ; ~ ~ ~ i s ~ ~ ~ s .
Wit11 L11(, o v ~ r g r o w i ~ ~ g 1)iologicxl tl;~ti~l);~sc:s s11c.11 as NCBI (N;~t , i (>~~i l l R~:SOII~CC for I3iot~rli-
110Iogy T~~Soxx~~iitiol~) l ~ I I I ( I I'D13 (I'r0t~i11 DaI,it B;I I I I<)~? ~ I I C I Y ! is 21 I I I I ~ : ~ I I I I ~ I I ~ I ~ of I ) ~ O ( Y S S C ( I
nlld 11111)rocc~ss~tl tl;~t,;l i~vi~ilablc lo sc:icntist,s. Solnctin~cs to l i ~ ~ t l a gootl st,;irt.ing point t,o
lhttp://www.ncbi.nlm.nih.gov/
' h t t p : //www . umass. edu/rnicrobio/rasmol/pdblite . htm
1.2 Our Contributions
on R ty1)ical tlt!skt.ol)/lal)tol) C O I I I ~ L I ~ , ~ ~ i l ~ l t,llcir c:sc.c:~ltion t , i r~~os is ~ ~ s r ~ d l y too long to l)c
11sc~1 frc:quc~t.ly. T h r ot,l~c:r o ~ l r s t h t rccl~~irc: less ti111e to c!sec:ulc, prucluc:c: lcss rdiiiblc
r~s111 ts wit11 IO\V(T, :iil(l s o ~ n c ~ t i ~ ~ i w I K J ~ , i l ( ' (X?])~; i lkJ](! q ~ d i t its.
IV(: i~~t,uoduc,l: IGAUNA, a ncw algorif I I I I I ~ I I I C ~ progri1111 t,o f i 1 ~ 1 glol~al l~;~irwist! a l i g ~ ~ ~ ~ ~ c - : r ~ t - s
wil.l~ w r y I~igli q11a1il.j. results ant1 ill ;i very oIIicic:nt lllilnncr, cwxl csc:c:~~I.i~ldc: otl a typic:;d
lal)t,op for large? s c ~ l ~ ~ o ~ m ! s .
\V(! ;tlso il~troclr~cc: il I ICW way of' ~ ~ l c a s u r i ~ ~ g the qlliility o f scquc:~~cc!s i ~ ~ ~ t l i~~t.ro(l~lc:c ;L
so-c:;dl(:tl opti7rin.l glob;~l wlign~llcwt I.)c:t,\vc:c:l~ t,\vo soqucwcc-s \rrhic:l~ c;ui I)(? ~lso(l 1.0 ~lloi~surr!
tllc i lwli ty of a givc11 i ~ l i g n ~ l ~ c ~ ~ l , .
1.3 Thesis Organization
Background
2.1 Biological Background
obtw.iiwtl from 21 t.llcwct.ic:id p o i ~ ~ t of view s l~o~ i l t l l)c c!sha~~st.ivc:ly tt?st;cd in t 11e liilxi to ~nillic'
s~lrc: it tloes llut ~ I ; L V C illly ~ i l i \ j ~ ) ~ . side (:fr(:ct.s.
11i1vi11g t . 1 ~ almvc p o i ~ ~ t s ~ I I 111i11c1, t IIP ~ C ; ~ S O I I S why 011c s l i c ~ ~ ~ l ( l I)(! f i~~i i i l i i~r wit.11 thv liolog-
difl'orc~~t, (losiigo ~ I K I tIr11gs ~ I I X ! prwxil)wl for ~IifI'crc~~t, 1)cc)pI~ ~ic~cordillg t,o t,I~c>ir ~CIIOIIIC: ,
i.c. t h wholr hc~ .~d i t , a ry i l ~ f o r ~ ~ l i a t i x ~ o f ~ I I C o r g i u l i s ~ ~ ~ cnc:oclcd ill DNA, wl~ic:l~ ~ n i t b ~ t heir
r w l ) o ~ ~ ! s to tl~c:riq),y (I issimil~~r; i ( l c~~ t , i f i ( :~~ t , i (>~~ of (1r11g Lnrgcts w11id1 arc l)rot,cins \vl~os(!
h111(:t,io11s G L I I h! 111odifim1 scl~!ct,ivdy L I I I C I l l t ~ l l ) t,o m r o i~ disoasc; ; 1 1 d Iilst, 1)11t 11ot l(~ist ,
itlc11tihc:;tt,io11 of m i s s i ~ ~ g or clc:f(:c:t.iv(-: gc~~c!s ant1 rcplacw~~cwt o r s r ~ p l ) l y i ~ ~ g 01' its protl~~c:t.s
2 .2 Biological Terms
CHAPTER. 2. BACKGROUND
r 7 1 I I V t , t : r ~ ~ ~ i:xoI1 WilS coinctl I)y t l ~ : i\rwric:a~i I,ioc:I.lc~~ilist. MTalt,rr Gill)c>rt in 1978 5 TTl ccsolis
arc t,yl)ic:iilly multiples of thrcc nuclaotitlcs (cvcry triplet of I m c s callctl a c:otlon is t r i~ns la t td
illto ccrr;;~ili amino ;~c:itl [2(i]). But, ~ i o t all ~ l i c inforuiat,ion irisi(lc t l ~ c D N A is cxpresscd ;is
prot,cins or IZNA, some regions of the DNA SO(~L~CI~( :O ilr(? (levo(;(:tl t,o c:ont.roI ~n(!c:li;xnis~~~s. -- - - - --
' h t t p : //en. wikipedia. org/wiki/Talk: Exon
2.3 Computer Science Background
:q)pro;~clies to solve t,l~is prohlo~n.
Exact string I I I ; I ~ C ~ I ~ I I ~ ~ I I I ~ S(?(111011(:(? a l i g m l c ~ ~ t ar(> 1.(!li~t.i\'t!l3: old topics O F (:0111p11t,~r
sc:icnc:c brrt tlwir rcw!r~t ctxt.e~~sivo ilpl)lic:;lt,ioris in bioil~ti.)r~ll;~t,i(:~ l l i ~ ~ r ~ s ~ ~ l t , ( x l in r (~n~w(x1
;~l , tc : l~ t , io~~ t,o t,11cw problclrls.
2.4 Computer Science Terms
W I I ( ~ ~ I l>~~i l ( l i~ ig ii suflix t,rw for a. st,ri~ig, soi~wt.ii~ics a. suffix of' tlii! s t , r i i~g ('ill1 Oc 1)iIrt. o['
i\ l o i i g ~ i ~ s~illix :11i(1 t,l~c:r(!Sor(~, its ~ ( 1 l)osit,ioi~ iiiiglil, I N , oti i l l1 i:(lgc o r k111 iiit(w1n1 I N N I P of
t11(, ti,ce. l b t11lsrlre t11a1. c!;1(:11 s ~ ~ l l i s i~ct,rii~lly c:ritls at. ;I I w f , a r ~ i ~ i c l r l c ! <:l~i~ri~<.t.('r \vlii(:lt is I I O ~
l);l.~'t. Of L ~ c ~~~)~ l :Lb( ! t , is ~ t ( ~ d ( : ( ~ ti) LhC Vll(1 of Lll(! ~ h % l g . l ' l l i ~ (IliIril(:t('r 1 l ~ l l ; l ~ ~ ~ ~ d(!llOk!(l 1 ) ~
S i ~ i l c l is c:all(:tl tlto t c r - inr id s!jitr601. I-Iciicc, t,lle s11f1-ix t rcc is ac-:t,tii~lly built oil ,S$. Figuro
2.1 sl~o\vs ;I suflix t,rc:c 1)uilt oil 111~ sc:c111~11c:c ATTATC:.
r ; ~ ~ ~ ( l o i ~ i l y g o i ~ c r a l ~ ( l c ;~~ i ( l i ( l ;~ l ,w , of' ( :o l i rs~ , iimsl, \vill 1101 get, i l r ( ! ;~so~~i~l) l ( ! s(:orc, :11i(1 thy
will IJC tlclrt,etl. IIowcvcr! ;I fow of tllc inst.i~nc:c:s rr~igl~t. gct ;I roaso~~nl)lc scoro (sl~ow nc:l.ivit,y)
a t d I.h(:sc citll I)c 11sctl Iowil~.tl Curt.hor solvil~g Lh(: p ~ ~ l ) l c ~ ~ i . T~ICSC c~111t1id;~tc:s arc Itc,pL i l11~1
Previous Work On Global Sequence
3.1 Scoring Methods
i\ 111oro gc:~~rr;d i \ ~ ~ d st,ill ( : ~ S ~ : - ~ , ~ - V ; I I ( : I I I ~ I ~ ~ C ~ 111(>t 1 1 0 ~ 1 OF i~ssig~~ilig S(:OIW~ is t i ) IISC ~~latxicvs
this c:oi~t,aiii t l ~ ! sc:orc of ~)ilit'-misc' illigl~l~l(!llt,s I)(:~\\Y!(:II (:\'Cry pair of t.110 ilIl)IlilI)(!l. ])('il~g 11s('(I.
Srl(:li CSRI I~ I ) IC!S ; I ~ O PAM [I21 i i i i ( l BLOSUR4 [l5] whic,ii ar(? witlcly rlsc:tl in prot,oin aligilinorll,
algorithnis. \:Vc IM: a s i~i~i lar nlatrix for pairwisc ~u~c:loot,itle scorc?s i l l IGAUNA.
Tlio c:l~oic-.o of scoring ftinctio~i (:;111 I~nvc? a grwl ii1il)ac:l; 011 t,ho cp~alil-y of thc fiml ;1lig11-
3.2 Dynamic Programming
d i ( w rr( : l : . ! I ) giws 1.he s w r o o f aIigni11g (d~;~ri~<:t,t>r :c with (:l~i~.ra(:t~cr :y.
U s i ~ ~ g this rcx:rirsivc! cquat,io~l, wo call clyri;~nlicillly l)uilcl ;L ta1)lc a~ l t l LM: 1 1 ~ ; values s t ,o rc~ l
i n 111~ l ) r ( :v io~~s r o ~ v s / ( ~ ~ I ~ ~ ~ l ~ ~ l s I,o ~ ~ l t i ~ ~ l i ~ t , ( ~ l > ~ (,;il(:~~lat,(! V 1 1 ) + 1 ~ 1 1 + 11. [Jsii~g I llis f;tl)l(!. \v(,
ci111 t.rt1c.e Imck ; L ~ ~ c I 1)uilcl tlic, nct,linl i ~ l i g l l i l l ( ~ 1 1 t .
B;lsc~I o i l t,hc N(~cc.lloiria~i-M'r~~liscti i i l go~ i t l i~ i~ , 1 1 1 ~ illit,llors i n [33] ~ I I I ~ I ~ V C t l ~ c SI);L(:C co111-
pI(.sit,y of N ( ! ( ~ ~ I ~ C I I I ~ I I ~ - \ V L ~ I I S C I ~ i t l p r i t l l ~ ~ l 11~1 11si11g ii 1 1 i w t,rirl< i n t .11~ LIP t.i~1>1~: II~sLcw(I O F
I<(!(.l)illg tllo \\l!lol(! h l )h! , t>llc!y Ollly h : l ) t,llc! hist ro\v a.ld c:ohlllll i l l l (I t.h0r(!f01Y! t.ll(!y lIS(!
l i~~c i i r sp;lc:r? ill t,hc Ic~igt,h o f tho i l~puts , i w t l l.her(! is no c l l i~ .~~go in t,l.~cl r i ~ r ~ u i l ~ g I . ~ I I I C . If 1 . 1 1 ~
il(:t,l~i~I o p t i ~ l ~ i ~ l idignnwlt is (Iosiml ( i ~ ~ ~ t . ~ i l ( l o f , ~ L I s ~ , tl~e! score of ~ I I C o p t , i ~ ~ i i ~ l a l i g i l i i ~ ~ ~ ( ; ) , tell(!
r1111ni11g t,iiuc will i l~c rwsc , l)ut, t,lic 11iag11it11dc will slily t,lic S ~ I I X (i.c. q ~ l a ( l r a t , i ~ ) .
Witli il sliglit. c:l~;~llgo to t,llc' for~~llilil (Iw(:ril)td ;1.1)0\~:, t l i ~ Sliiit,li-\li;lt,(:rllia.11 algorit,i~ln for
h :d il l igl l l l l ( ! l l t~~ (2111 I)(' O\)ti\ill(!c1 [ S G ] . I t 1lil.S (~llil(Il.ilti(: l.llllllilll?; tilll(! ;111(.1 SI)ilC(f ( : 0 1 1 l ~ ~ ~ ~ ' ~ i t ~ ~ ' ~
3.3 Anchor-Based/Hit Met hods
3.3.1 FASTA
3.3.2 BLAST
\vllorc li is i1 (:om~)l~till,lc I ' u ~ l c t i o ~ ~ of S i111tl 7) is tllo prol)i~l,ilit,y of l i ~ ~ t l i ~ ~ g a11 LISP wil.ll
scoro grvat,r!r or (~111:il l.o S [?:%I. r 7 1 t~is rosi~lt giws ~~l(!iu~il lgfi~l SCIIIRII~.~(:S t.0 S: C;ivcm sin: 01' 1Ii1! (I;lt.i~l);ls~ 1111d i ~ . scorilig
syst(!111> t,110 rw111t d c t c r ~ ~ ~ i l ~ c s wl~at, I I I ~ I I ~ I I I ~ I I SCOIW \vc IW(-XI look for i 1 1 O I Y I W to 11ot gct,
r i~ndo~rl I~it,s.
'Tl~c g ~ ~ ~ c r a l i1lg0rit.tl111 of BLAST call I>(: s u l ~ u ~ ~ a r i z c d iis li)llo\\rs:
Step 1:
b'ind all S L I ~ ~ S ~ ~ ~ ~ I I ~ ~ ~ I ~ ~ ~ ~ S of IPII$II I V , S I I C I I t11at tllcir score ; l g i ~ i ~ ~ s t , t 1 1 ~ q ~ ~ c r y (2 is ut h ~ s t
7'(< '7). I V is t.vl)icillly cclual to 3-5 i111d 11-12 for proleius i111tl DNA scc111c:llc~:s rcspcctivcly.
S t e p 2:
. '~;~l . ion; t l C:cw(c!r ti)r Iliof.c:c.l~~~ology I n l ' o r ~ t ~ ; ~ t . i o ~ ~ (o) http: //www . ncbi . nlm. nih .gov/ 4 r
I ' l l( ' ~ I I I ~ I I ( > ivc!1) ; ~ l ) l ) l i c , ; ~ l i o 1 1 V ; U I I N . t ' o l~~ ld i l l http: //www. n c b i .nlm. nih. gov/Education/BLASTinfo/ information3.html
"http://en.wikipedia.org/wiki/BLAST
111 ortior to c:ol~q)ut,e t.he co~~lplcxi ty o f ULAST, wt! 11aw t.o kuow "w11;lt. is ;I c l ~ a ~ ~ c c li oL'
;I S-swriug sccl~~!l~c.c\ lmf. Ilnving a T-scorillg ~ror(1 OC size I/\/ ?". lSx1)(!1.ill1(:lll.;II rohl11l.s SI IOW
f llnt givcl~ T ~ I H I 1V, ~ I I P ca11 f i d n ;~ntl h sudl t,l~i~f 1; = e-("S+b). I3;1scc1 on t.llis, Int 14: IN.
;I nwd)c r of nlorcls gcncrat.cd 1'01. an i r ~ p u t (Illcry ill Step 1 a ~ d iU I)(: a 11111nl)or of rc!sid~~(:s
i l l t,l~c: tlat,i~l.)asc. T l m l colnylesit,y of BLAST is O(rrN7 + 6.n. + -). Aftw illt,rotluct,io~l of t,llr o r i g i ~ d BLAST, lllarljr tliffcret~t versiorls i ~ i ~ ~ ~ t : ( l a t ( l i f l iw~~f ,
t,yl)c:s of s o c l ~ ~ c ~ ~ ~ c c ~ s (i.v. nmil~o-acitls, protc!ins, otc..) ; ~ r l t l for (lill'(:rc~~t. pl;~l.for~ils w(>ro tlcv(!I-
opctl. Solrlc: of 1 . l 1 c w arc DLASI'N, I'RLAS'T. Ul,i\S'l'S, PSI-IJLAST, C:I\PPISD-I~LI\S'L',
J,IEC,A-DI,AST, 17St\-UI,t\S'T. MJ1J-I)L!\Srr, 711,1\'1', cst,c:.
PSI-Blnst (Posit,ion-Si)rc:ific Iterative BLAST) ;ml GAPPED 13LAST are int.rotlucrtl ill
(121. The idea. I ) c l ~ i ~ ~ t l GAPPED-BLAST is as hllo\vs: TIN: o r i g i ~ ~ a l BLAST Iillt1.s i1 si~lglo
\vc-)rtl of Icl~gt,ll u~ t,llnt scorcs a.t, loi~st ?' against tho clrlc:~,~;. B11t i f wc lir~tl t,nw n.ol.11~ ()I'
le11gt~11 6: ant1 sc:orc: T t ,hi~t lic O I I t l ~ e S ~ I I I C tliagonal wit.11i11 tlist,anc:e /1 fro111 each o t l m , t.11(:11
if 7' si~.t,islic:s cc,rt,ain criteria, mc just. c ~ ~ t c l ~ t l T i ~ l l o w i ~ ~ g gaps too, to rc;lc:h 7'. Usi~ig t.llis
mc:t,lloti, nv c>ntl 111) wit11 rilorc hits: l)ut, sillct: we (:over 111orc of thc: st,ring (1)y c:on~lcc:t,illg ?'
3.3.3 CHAOS
3.3.5 GLASS
1. For a n i~~i t iwl k , find all n ~ a t c l ~ i n g k - r ~ ~ c r s (k-long \vortls).
~\dlJ!l/lrri.c:r. ; I roliit.ively f ~ s t g l o l ~ l alig~l~ricnt. algi)rit,li~~i ~ ) Y ( ~ ~ L ' I ~ ~ , C ( I i l l [13] 7 . It. ~ I S C S s~il[ ix
trees t,o fild 111ilt,c11c!s I)ctt,\vc:c:~~ t.wo s t , r i~~gs .
hlUi\,Ii~ic?r usc,s ~ i ~ i l ~ i ~ l l i ~ l ~ ~ r l i q ~ w cxit(.t I I I R ~ C ~ I C S callctl 1\4 lil\%q as a~lcliors. The, ~iriicluc~lic~ss
Ol' a 1lliltdl i l l t,llc t.WO scclll('llcW 1llc:lllS t,lli~t tll(\l'e> 11~s t O h! Ollly 011(! (:OpV Of t ]I(' ~ l l t l t ~ ~ l i l l g
3.3.7 ClustalW
111cnt.s '. It, is 111orc scwsit.ivr! tl1il11 tlic: ol.li(:r col-~~rriorily-~lsctI global i l l i g ~ l l ~ i ~ ~ i t ~iiot,l~~(.ls by
llsilig tJic followil~g ~~~ct l ioc l !':
*[I. is i\\viiliil)lr ; \ I http: //www. e b i . ac .uk/clustalv/\#
"http: //bimas. dcrt .nih.govjclustalv/clustalw. html
3.3.8 AVID
011cc tllo rclc:r~rsio~l is c~)~~iplc:t~ctl, AVID nligl~u t , l lc \ rcrni~i~~ing ur~alig~lctl rcgio~is 1lsi11g fllc
N ~ ~ ~ I I I I I ; I I I - M ~ I I I I s ( : I I i ~ l g ~ r i l . l ~ ~ ~ ~ [34] i f ~ , I I c , v ill.(> s~~ fF ic i~~~ t , l y ~ l ~ r t . i111(l 01 hcrwisc I C ~ I V W I . ~ I ( ~ s c
regiolls rlnalignocl. h/IAVID [Ci] is a. progrcssivc: ~llnll.iple i~lig~~m(:~lt . t,ool t.hi~t il.lc:orl)orittc:s
AVID.
3.3.9 MGA
3.4 Genetic Algorithm
3.5 Hidden Markov Model
problcn~, t,Ilcu t,rans~ilissio~i a ~ d cniissior~ ~)rol>iil~ilit,ics slinultl I,(: tlccitloil I)y tra.ir~ir~g using
I~ 'o~ .n~c~r-r l / l l i i~(~k '~~~n~~~( l i ~ l g o r i t ~ h ~ ~ ~ s . Thew 121~ Vitcrhi ; ~ l g o ~ - i l , l ~ ~ n (:an I)c ~lsctl to itlig11 SC~IICII(:( \S.
O n e gootl Li.ilt,u~<! of H1\?1\'Ih is that they can l x rlsctl to itlrntXy dictlic!r ;I s c ~ l ~ ~ c ~ ~ c c .
I ) e lo~~gs t.o il piu.ti(:ulur Sa~~iily 01' s(:q~ie~lt:(:s (i.c. ~ ~ O ~ C ~ I I S ) [X ) ] . 1 1 o ~ v ~ ~ w r . this apl)roac11
is not as popr11;lr a s other nlethotls, I)cc:iiusc? tlic t.ol)ology o f t,lw IIMhI 1110t1ol is I~igl~ly
tk:pc~~tlant on t l ~ c ~)ilrti(:uliir ~ ) r o b l c n ~ ilncl t.hc sc:q~~c!~~cc!s I~cing stutlictl. Wc idso 1iwi1 a
largo 11urnl)cr of squc~~c:c:s in o r t l ( ~ t,o / , ~ . n i r s t . 1 1 ~ III\/I.\,I i ~ 1 1 ~ 1 find tlw ~ ~ ; ~ I I s I ~ ~ ~ s s ~ ~ I I / ~ ~ ~ I I ~ s s ~ ~ I I
prol)rll)ili t,ic!s.
General Algorithliis For Anchor-Based
4.1 Building Suffix Tree
Using >I 11i)ivt: ;~pl)roacdi to I)uiltl a suKis two O I I ;I s h i ~ l g S[l..ll], tak(!s O(.r,,'') ti111o a~ l t l sI,ilct?.
\IT(! ~ . ; L I I tlo t , l ~ ; ~ t i l l ; I I I if,c:rat,ivc: \vay ils Sollonrs: 111akc tllc: t r w l),y 111aki11g a ~.oot, ;111d i l l 1 ctlgt:
t111d Ii1l)cl t,hc edge: wit11 t11c lor~gwt suffix of S, i.c. .S itself. 'I'11(:n talw 1 . 1 ~ : ~lc?st. s ~ ~ f l i s l)y
d i ~ ~ ~ i ~ ~ a t i ~ ~ g thc first, ch i r ac t , c~ of t h p e v i o t ~ s s t~f l i s ~ I I I C I l , r a \ w x t11e t r w s t a r t i~ tg f1.0111 tllc
root,. As long c:har;lc:tkrs nrc fo~u~c-l t l i i~ t rnatcl~ tllc (:11rr(wt s111Iix 011 thv t.rtx?, follow 1 . 1 1 ~
cxlgos ilrl t l l)ra~ich(ts. \VI~CII ;I cl~i~.rac:t,c.r that. tloc?s lot. ~l~atc:ll tl1c nest cl~ari~c(.er 011 the ~ N Y !
is c :~~c~or~~~t t : ro ( l~ crcatc a 11cw hri111(:11 a ~ ~ t l a11 ctlgo al~cl laI)t!l tllc ctlgc: with tllc rcl l l i l i~~i~lg
c.llim~c:t.t:rs o f t.110 ( : I I ~ I . C I I ~ s ~ ~ t f i s . An a.11 ~ I ~ I I R ~ ; ~ V ( I 1 1 1 c h ( l using t,hc S;LIIW i t l ( , ; ~ is I,o s t ~ r t frolll
tlw sllortc~st sufiis i ~ r l c l i~ t ld l o l l g ( ~ sulliscs in (wt:11 il.eri~t,ioll. Figuro 4.1 sllows t.l~is 1)rocc:ss.
1<i1(:11 s111Iis o f lc!~igt,l~ I I L , tci~n 1)c i ~ d ( l ( d i t , 1,o [lit: t,ro(~ in O( t i 1 , ) 1,i111c ;~ncI ll~crt:Sort> t,l~t> t,oIi~l
4.1.1 High Levcl Ukkoncn's Algorithm
Definition AII h p l z c i t S,u[Jix 7i~f: 011 s txi~lg S, is a. tru: Ol>t,i~i~l~cl I'I'UIII the S I I I ~ X t r ~ c li)r
S l ~ y r(!lnovil~g (vor,y (O~>JJ of t I N > t(mlli11a1 syni1)oI !i fro111 t,11r (:(lg,~ lal)(,ls oc t,lw 1 r(x?, t lmi
l'('lllO~i11.g ;Lily cYlg(? tlli\L 1 1 W 110 Ii~h('1, iill(1 t I l P l 1 ~(2lllO~illg i\lly 110(1(: f.,ll:lt, ( 1 0 ~ ~ l l ~ t , 11ilv~ 1L.t
lcirst Iwo c11iltlrc.11. W r ! tl(:~~ot,c: t,lw iml)li(.it. sl~llix t,rw of t11(: s t r i l ~ g S[l..i] I)y Ii.
/ \]I il11pli(4t, s u l h t,r(:(! 011 S i l ~ d u l w all t,lw s ~ ~ f f i s e s of S, I)ut S O I ~ I C sulfixw ~ l ~ i g l ~ f , 11ot~
( : I I ( ~ at. i l h f . l2ig11rc 4,2(;1) sh)\vs ;UI i~x;11111)1(! o f i1.11 ilnpli(:il. s111lix t.ro(:.
Uku1111c:n's a lgor i t l~ l l~ is tlivicl(:tl i~lt,o 1ri ~)lii~sc>s. In phi~sc i + 1, trc:c: I , + L is c:onstn~ct,c>tl
flx,l11 I , . Eil(:Il l j l ~ i ~ ~ ( ; i + 1 is f l~rt ,I~(>r (livicl(:(l il~t,o ,i + 1 ( ~ x t , w ~ s i o ~ ~ s ~ O I I C for (lil(:ll of t,11(! L -I- I
s ~ ~ f t i x c ~ of S[l. . i + 1) . 111 clxt,rnsio~~ , j of pl~iisr i + 1, t h algorit,l~rn first h d s tlw c m l of tJlc!
1,i~Lll S ~ O I I I 1.11(! root laI)~le(I wit.11 s111>stri11g S[,j..i]. It. t11e11 C S ~ . C I I ~ S t 1 1 ~ S I I ~ S ~ ~ I I ~ l)y i ~ ( l ( l i ~ ~ g
Ill(! chi\racl.c!r .S[i + 11 t,o it.s cntl, I I I I I ( M .S[i. + 11 alrcwljr i~.l)lw;\rs I.11c:rc:. I I is jllst I,h: sil~glv
c(la~? li\l.)cl~tl 1 ) ~ ' ~ : l \ i l r i ~ ( : t ~ r S[1].
I : Collst~rllc:t II 2: for i fsolll 1 Lo 711 - 1 do : I : {pcrformilig plli~sc~ i + I } 4: for j from 1 to I. + 1 do (7: {p(~rfoslnillg o s t , o ~ ~ s i o ~ ~ j) (i: E'intl t . l~e clld of the pat,ll fro111 the soot. lal)clc?cl S[j..i] ill tho currc!nt tmo. If ncetl(d,
est,c!l~cl t,lli~t I ) ; I ~ . I I I)y iltltlil~g cllilri\(:tcr S[ i + 11 t,o mit,kc SIIIY: Llli~t, S[:j..i. + 11 is i l l tall(: trcc,.
7 : end for 8: end for
4.1.2 Speedup Techniquc, Part 1
i + 1 \\wr1<s.
Now, \\Y: (.i111 i ~ ~ t rotl~~c:t> a toc~l~t~icluc~ t.I~i\t will rc~luc:c: t,l~o \\:o~,st c:i~sc r ~ ~ l ~ l ~ i l i g l i 1 1 1 ( 1 of t l ~ ( '
i~ lgor i t l~ l t~ t .0 O(II ').
I P 4 GEIVERA L A LCX)RI?'R.\.lS FOR ANCHOR-BASED I\.I/,:?'IIODS 3 1
Algorithm 2 Sirlglc 12xtc1lsio11 Algorit11111 (SEA)
( :11;11 . i l ( ' f ( '~ g 0 1 1 t11(' ( ? d g ( ' i l l l t l C l l l i t s , 1 1 1 i ~ k i l l g S111.(' t , l l ; l t t.110 7 ~ ) i l t l l f1.0111 . ' i ( l l ) ( : 1 1 ( 1 ~ 0 1 1 t .11 i i t (?(Ip,(?
( ! ~ i i < : t , I y /J c : I l ; \ r ~ [ ~ r s ( I O I V I ~ its l ; i b ( ! I .
4.1.3 Speedup Techniclue, Part 2
0 1 1 ~ 1)~01)1~111 t . 0 1)1.0(:(!(1(1 SIII.I.~ICI. il.11(1 I.OCII~CC t . 1 1 ~ r l l n n i ~ ~ g (.i~iic OC U ~ ~ O I W I I ' S i11gOrith111 10
O ( n 3 is 1l1o fact t l ~ i ~ t it ' we rcc:ortl ill1 thc? c41;irac.tcw 011 t11c etigcs of 1 . 1 1 ~ t r c x ~ , t11c i~lg:)ri t , l l~~~
will rt:cluirc, T)(ri.') space and tIicrc!fore O(ir) r l m n i ~ ~ g time will 11ot I)e ac:llirvirblc. To over-
csoi~~c. this l)rol)1(:111, i11stt;d of roc:ortling c:l~ar;~c:ters, A:(: la1x:l tllo ctlgils l)y ;t pair oS i~itlicc's
i ( l (~~~t , i fy ing f.he st.irrt and (211d irr(liccs of t h c s~~I ) s t , r i r~g on that. edge. This wily, oldy 1n.o
1 1 1 1 l l l t W ~ ~ ari! writ t.(lll 011 ibllY (:(lg(! illl(1 S ~ I I C C : t,ll(! 1 1 1 1 1 1 1 ~ ) ( ' ~ o f C ( ~ ~ C S is ilf. 111ost. 271. - 1, t,llc: t,r(Y1
will o~ i ly IISP ~ ( I I ) S ~ I U : .
Observation 1: 111 ally l)hi~s(?, if sufIis c x t e l ~ s i o ~ ~ rul(: 3 ap1)lic:s ill cxt .cus io~~ j : it will
i l I ~ i 3 i1.pl)ly i l l f'llrt,Il~r (~xt ,e~~sioi is 1111t,iI t.11(: (? l l ( l of t,lli~t 1)IliI~i'. 'rll(? ri!;ls011 is t,lli~t, \V~IC! I I 1'111(!
3 i~.ppli(!~, t.11(: l)iltl~ Iiil~clntl S[j . . i ] i l l tlre currt:~lt. trcc ~rrust, c :o~~ t i~~uc : with cl~ar;tc:tcr S[.i + 1 1
i l l l t l SO 1.110 pi~f~11 181)(:1t!d c?[,j + L..i,] (Io( 's i l I ~ 0 , i l l l ( l I'll!(: :: ilgilil~ a ~ ) l ) l i ( ' ~ i l l t . 1 1 ~ lI('S1. (~xt.(!l~siolls.
It is a.lso I)cric?fic~i;d to ol)sc?rvc: that. a ~lc?n. s u f h l i d t ~ l w c l s t.o l)c ntltlctl to I.l~c t.rcc: o~ i ly idtor
it11 c s t c ~ ~ s i o ~ ~ i l l ml~ic:l~ c s t c ~ l s i o ~ ~ rulc 2 a.pplic:s. Now we ~:;III stat,(! t.110 I I ~ X L t~.ic:lt.
T c c h n i q u c 2: K ~ t l ally plinsci i + 1 tllo first I,i111r Li~i~t, cxf.c~~ision r111c 3 iq)l)lios. If Illis
I ~ a p l ) o ~ ~ s i l l ostmsioii j : t11c11 111(:re is 110 ~lectl to t:xplir:it.ly h l 1.11~: e ~ ~ ( l of' ally s t r i ~ ~ g S[X:..i]
I'OI' k > , j . id1(: ( : i l l ] t l l(! (~xt,cllsiolls i l l l ) l l i l ~ ( ' i+l I , I I i l [ , ill'(' (1011(: ;lfk!l' t , l lC [irst ~ ! s ~ Y ~ ~ l L i o l ~ of rIl](!
3, . i 7 ~ 1 p l j ( : % t (-*x1~11sioris.
Observation 2: I f ;it so111c: poi l~t i l l Ult l to~~c.~~'s illp0rit11111 i~ 10i1.f is ~~.(:ilt~(:(l i111(l li\l)c:lctl ,j
(for 1 lie s11Uis st , ;~rt , i~ig ilt, p o s i f h ~ i , j of S ) : t , l i c~~ t,lli~f., lvaf will rcmri11 ;I h f i l l i l l1 s~~(xx!ssi\:c
Lrccs c:rc:atetl (Iuriug the illg0~.itl1111. TIIC r c a s o ~ ~ is t l~ i i t t l ~ e r ~ is 110 I I I C ( : ~ I H I ~ I I I ill t.11~ algo1.itl1111
I;o c ~ s t , o ~ ~ t l ;I. Ical' c~lgc? l)c~,yi)l~il il-s c,rirrc?~it, I w S , i.c wl i r~i ;I IcilT is Ii~bcl(:tl , j . c!sl- i!~~sio~~ nil(> 1
Wil l ill\Vnys ii])[)IJ' (,o (Xk!rl~i(>II ill illly SIIC(~C~SS~V(! pllasc!.
I,ct j , tl(:~~otc, the lilst, extc~i~s io~i i l l this scq~lc~~c:c~. Now \vo (::III proscut t l ~ c last trick.
Technique 3: In ~ ) l~ i i sc i + 1: wlioli u Ioirf ctlgc' is first croi~t.c!tl and n,ol~l(l 11or111;111y I)('
1al)cletl wit11 sul)stxi~ig S[l)..i + 11, i~~st,c!ad of writ i l~g i~~tliccbs ( p , [+ 1) O I I the ctlge, writ(: ( p , o ) ,
w l ~ ~ r c (, is s,v~nl)ol tlci~ot,ing "t,Ilo crirrcwt cr~tl". Sy11il)ol P is $1. glob;~l illtlex t h t is sct, tao
i + 1 once in c!acl~ phasr. 111 phas(-: i + 1: sin(:(: tho idgoritlm k~lows I llitt riilo 1 will i11)pIy in
c:sto~rsio~ls 1 L11l.ol1gl1 , j , i ~ t l(\i~st,, it I ICC(I (10 110 il(l(litio11i11 cxl)li(:il ivork t o i~~l l ) lc i i~( :~r l t.I~os(~
j ; cst,c?~lsio~~s. Inst-xwl, it. only tlocs c o ~ ~ s t a ~ ~ l , work t.o ~ I I ( :~CI I ICI I~ , v i ~ r i i ~ I ) I ~ C! ~ I I I ( I t h 1 (!OW
oq)lic:it, work I'or (so111~) c:xtcusio~~s st nrt.ii~g wit11 c:xt,clisior~ , j l + 1.
L s i ~ i g tc>c:l~niq~~c?s 2 i ~ l d 3, csplicit, c~st ,m~sio~ls in plii~sc i + 1 using 2 iLrc. o ~ ~ l y r c q ~ ~ i r c d
A l g o r i t h m 3 Si~igl(, I'hilscl Algoritlm (SI'A)
I : Inci~oii~ciil, iiitlw e t.o i + 1 (By X?c:hniq~ic 3, t l ~ i s correctly irnplctncnts all ill~l)licit. c.xtc:~~sions I t,hroug11 j , ) .
2: Explicitly coull)utt succcssivc cxtciisior~s (usiug i~ lgor i t l l l~~ 2) ~t i~rf . i l lg a t j , + 1 u~it i l r c x - l i i ~ ~ g tlic first, oxl.c~isioii J ~vlierc r~ i l c 3 applies or until 2111 (xtc~iisio~is ilrc ( I O I L C in this l)ll;~so (13y 7'ric:lc 2, t liis c~orroctly i ~ n l ) l ~ w ~ c i ~ t , s ill1 ( . I N > :~tl( l i t io~~ill iiril)lit,it, cstcwsiolis :j + 1 t , l ~ r o ~ ~ g l i .i + 1).
:%: To prcparc for thc nest step, set; 3, + 1 to j - 1.
Thcorci r r 2: Usiug s u l h l i ~ ~ k s a i d triclts 1, 2 a ~ d 3, IJkkoi~c!~i's ;~lgorit,luii 1)liiltls i~~~l ) l i ( : i t
sldfis t,rc!c!s Il t.llro11g11 I,, i l l O(n.) ti111o.
4.2 Fincling Maxilnum- Weight Anchor Set
C4ivc1i u set of ~ n a t c : h s , we would liltc t,o scloct sct of 11011-crossing niatcklc:~, also r(\forrcd L o
13i1s(~l on t11is d e f i ~ ~ i t i o ~ ~ , a givc11 111;1t(:h :\I = (.il, i2.11,!2) c o r r c s p o ~ i d i ~ ~ g to a ~ l ~ l i ~ . t . ( ~ k ~
h?t\\.e(?ll Sl[bl..i2] and S2[ll..12], dofillf!~ poillts Of il ~.(Y:ti~llgIC 7. in t h ~ (.ill.l.(&Ul [)Iilll(' with
f IN: I)ot.t.o~~i Icft. c o r ~ ~ c r I ) c i ~ ~ g thc p o i ~ ~ t . (61, L 1 ) i111tl I.lie top rigl~t, c o r ~ ~ c r l)c!i~~g ( i2 , 12). Ilcfino
thch wc+ght, of r. to 1w its tlrca. T l ~ c ~ ) r o l h l ~ ~ of fincling i~ scl. of 11lnsi111n.l ~rintc:hos, rc(ll~ccs to
filldillg i L s(!t, 0l' llO~l-(:~O~~illg' ~~Y:l.illlglW Wit.11 l l l i l ~ i l l l l l l l l ilrCil COV(!l'HpC. L;IOl'C! fO~lllill1)':
Dcfinition Let.
I)LLS(?(I 011 t11cir XII IML:~ itii(1 \v(, s\v~c!1) I , I ~ O I I I f'ro~ri right t o lcff,. At, (>il,t:I~ st,:i.gt?; wl1(:11 \w are
1)roc:ossing rc>c:t;uiglc i, wc! ivarit to 1i11k it to t,llc I I I ~ I X ~ ~ ~ ~ I I I I weigl~t pi~tll i l l t h ilil.(:rviil
[ S I I , ~ : I : [ ~ ] , +x] x [Y~rma:[i], +DL] iintl storc i t i l l ii sot I I . \\'o [lsc1 Nc:i : t [ i ] to clor~oto t,licl ~icst
rcc:t.iu~glt: i ~ i tlic I)ost patli c o n t a i ~ ~ i ~ ~ g roct~a~~glo i ( t l~o piit11 witli 11iiisi11lu111 iveigl~t.).
Tllc! O ~ ) ( ~ ~ ~ l t ~ i O l l ~ l l ( ? ( ! ( ~ ( Y ~ t,0 ~)Ilild iLl l (1 lll~illipllh~:(~ sel, D i l W U [ I ( / ( L ~ C illl(1 DC5b. TllP OI)('riL-
t io~i Besb(D, y) rct.urus Mccttl( i) of a rcctt~ngle with ~ ~ ~ i ~ ~ i l n u l i i Y.tr~i.r~[i] 2 and ret11r11s O i f
I I O s rd i rc:c%ii.~iglc i exists. TJ~)tlatr:(L?, I ) ~~l)tlatcls t,lic sc:t I1 as I'ollows: It ii(lt1s 1 . 1 1 ~ Iwsl pa(,Ii
startir~g at; rwtat lgl~ i to D, hit. i t pr(!s(:rv(:s t l ~ c (:o~l~l)at.ibilit,y alliol~y II I ( : I I I~)C~S of D. 111
otlic:r worcls. Sor ally I\vo l)ii.t,lls p i i ~ i c l (1 swli t,li;it. J I 5 (1 i l l set. 11, wc: wol~l(l liko lo r o ~ ~ i o \ ~ c 1).
A l g o r i t h m 4 k' incli~~r klaxiniun~ \.\'cirlit Anc~i~ors
Now iv(! (.a11 I ' o r ~ ~ ; ~ l l y (l(:sc:silw t l ~ i s prowss i l l Algorit,l1111 4.
111 ortlor 1.0 i~c:liic\~! ;L ~ I I I ~ I I ~ I ~ ~ t i m : of 0 ( h g 1t), wo 1i(wI 1.0 cllic:icwl.ly i ~ ~ ~ p l ( : n ~ ~ l ~ t . sc1. 11 so
t-liat t 1 1 ~ Br:,s/. a l~ t l b'ptLrl.l,c opcnitiolis I.i~kc O(1og11) tillw. Tl~i~t, 11lci1ns \\re ~ ~ c ~ t l t,o Oc i ~ l ) l ( t
t,0 ~(~tll'(:Il, illscl't, joill allti split i l l C)(lOg 1 1 . ) tilll(:. \AT(! (:ill1 i l ~ ' ~ . O l l l ~ ) ~ i ~ ~ l tIlilt, 11sillg k i l I ) i l I i l l I (Y!( I
~ ( ' i ~ ~ c ' h t rw . S i ~ m cwc:h pal.11 i can bo u n i q ~ w l j ~ ic-lo~~tifiotl by i1.s Hmrl(i.) ( ~ L I I ~ t . 1 ~ rwt o f t . 1 1 ~
path (:;u1 bc const~ruct,cvl u s h g .Vc..ct[l] poi~lters) , cacll clenwiit i l l D (:iui I ) ( : sc~pr.c~srlit,ccl ly
; I 11111111)(~ w l~ id l is 1 1 1 ~ s~( : l .a l lg l~ n111riO~r 1 . l l i 1 ~ i l . ( Y ) S ~ ( ~ S ~ ) O ~ ~ ( I S lo. Eac:h r~l(:nic~l~l, of D Iias ;L
To/nll,Vci</lrt i111d a Ylttil?. ~ s s o c i i ~ t d wi1.h it. HOW~VCI. since all c!le~~~o~-it,s of 11 ill.'! n~ut.ually
cxm~pi~t . i lh i i l ~ t l Oc:c:al~se i l l t l ~ o for lool), wc: SGIII 1 . 1 1 ~ c1(:111o11t,s I);isetl 011 tl~c:ir .Y-c:oortli~~i~t,c~s
fro111 riglit, t,c) l ~ f t , , t,11c ou lw of i~scel~(lillg Tot.~~~lIV~~~t.,ql~t. is 1 , l l ~ silllic as t,Iw o r ( h of tl(w(~n(li11g
Yinirt . I'liorcforc: although nlc Ii;~vc t,wo k ~ y s associntc:tl nriLll tach clcmcnl, (i.c ?'o/.crlWr:~i~ylrL
;11i(I Yti~in)! if \vv sorl, t.11~ ~ ! I ( W I O I I ~ S i ll the L r o ~ OII OIW lwy~ 111ey will I ) ( ! sor~.(xl on t.11~ o t h
lwy i l l 1 . 1 ~ o1)positc ortlor.
GAUNA
In this c:l~apt,c:r wc will tlescrilx GAUNA (C:lul)al A l igm~~(mt Using ~ 'o I I - cx~~( ' [ , Al1~1m.s)
( l c l o j l i t S i I S J i i v i I l i e I l i o l ~ l ~ l l i s l ~ i t . I . 141) i l . t ~ t l sliow its
~)(:r l ;)r~~~ar~c-:(! c~oiu~);irc?cl t,o otlwr stat(.-of-t.hc!-i~rt, i~lgorit 11l11s. 111 ortlrr t.o (lo t l~ i s , n.(, will licit.
givc ;I Iligh-1ovc:l ovcrvicw of 11on: GrI l JNA works i ~ ~ i t l t l ~ ~ l l n ~ ! will ( ~ s ~ I w I I I I :V(>~JJ 1)il l . t) of' t l ~ c
algori~,lut~ in 111orc (Ict,ilil. GAIJNA is t l ~ r 1)asis for I(:I\UNI\ i111t1 t l~crcforc: r~~~il(!rst.;llitli~ig
5.1 Met hod description
G A U X A is i t rcutrsi\v a l g o r i t h ~ 11asc:tl 011 thca lollowing t , l ~ l w l r ~ a i r ~ s t ~ p h :
Algorithm 5 G A U S A H i ~ h Lewl
5.2 Finding Maxiilia1 Inexact Matches
~ i v ( ? l l t,\\W s(!(!U(!llc(!S ~ 1 . 1 1 ~ 1 .s2: ?I, ([llil(I~ll~)I(! ( i 1 , .i2. l l 1 2 ) is ('iIII<!(I ki rrt.(lk:/t . i f t.Il(! ol)l illli11
i ~ l i g l l ~ ~ ~ ( \ l ~ t s(:or(! o f t,11c t.\vo S I ~ ~ S ~ ( I I I ~ I I ( : ( ! SI [I1! i l + - I] i111tl S2[i2, i:! + 12 - I] is grcJiatc:r
tllilll 01. (YlllilI t0 W ~ ~ l . t ~ i l i l l tlll.(!~llOld. NO((! t.llilL i l l tile ~(X~llOl S [ i , j ] t lC l lO t ( !~ tllc ~ l l b~ ( ! c lU( ! l l ( ' ( ~
of 1 . 1 1 ~ sc!qiIcnc:c S sl ;~rt.il~g i ~ t p o s i l h i i i u ~ d cllding iil, 1)ositiol~ j . If Sl [ i l : i l + 11 - L ] =
S2[i2, i.2 + 1.2 - I! t.hc 111at~c:h is c:allecl an c m ~ : ~ ~ ~ i l k l i . , o t i ~ ~ r w i s c it is cidlocl a11 iiwxnc.1 711.(~t(,l~.
I;i)llowing t . 1 ~ t lc l i~l i t ion ol' Dc1cllc:r ct a]. , a 1uatc11 ( i l , 1 2 ! l l . 12) is c:all(d Y I L . U . X ~ ~ I L ~ ~ il' i t (.i111110t.
be cxtcwclctl at ctit.hw cwtlpoint [13]. For irlcxict, i ~ l ~ c h o r s wo p l c w l i x c t l ~ i s t l c f i l ~ i t i o ~ ~ 21s
f'oIloi\:~: AIL cxi~ct. I ~ I L L ~ . ~ I ( i l ! i2! 1 1 . 1 2 ) is I I L . K L L / I ~ . ~ if' t11erc is 110 o t l ~ ( ~ r 111ill ( i i . i h , 1; 1,;) stl(,l~
tlial. Sl [ i l . i l + l1 - 11 is il 1)ropc:r S I I ~ ~ S N ~ I ~ ( ! ~ I ( Y ! of Si [i', , b', + I', - 11 i111d S21i2, i2 + l2 - 11 is
21 l)ropcr S L ~ I M Y ~ ~ I ( ! ~ I C ~ o f S2[i$;i; + l b - 11. Wc w i l l 011ly (m~si( lcr i11wact ~ n i ~ t ( ~ l ~ c s for w11icl1
l 1 = l 2 ~ I I I ( I t , l ~ c l ~ f o ~ 011r l l l i ~ t c l ~ ~ s will I>(! S C ~ I ~ ( W ~ ~ I I , N I l)y il t,riple ( i l . i ,2 , 1 ) .
11s i l l ot,Il(!r ~ 1 c l 1 0 r I ) ~ I s ~ ( I ~ ~ I P ~ ~ I I O ( I S ! G A U K i i LISPS st~lt ix ~,NY!S to lilul : I I ~ ( ~ ~ I O W . 111 Scctio11
4.1 n.e Ila\:o tl(w:ril)c:tl i l l tlctail I ~ o w t.o 1)uiltl ;I suftis tree i l l lil~ctar L i l l x i111cl Sl)il(.('.
For ;I S ~ Y ~ I W U W S, t,llc: snlic,lll-, fcat.11l.c: we ~lc:ctl of a slllfix t r w for S is t,l~;lt, t,llc> ~.OII( , ;II ;(~-
11a1io11 of odgr-lnl)cls 011 t,hc ~ ) i ~ t , l l fro111 the: root t,o i111 i n t e r ~ ~ i l l ~ lo t lc is n ropci~t sul)scqt~c~lc:c~
i l l S \\:11(w tllc: 11u11l1)c.r of ~ , c l ) c ~ ~ t s c:orrc:sl)ol~cli~lg to all i ~ ~ t c w ~ a l ~ ~ o t l ( . is ( y 1 1 i 1 1 t,o t,llc I I I I ~ I I I ) ( T
o f I ( J ~ I v ( ' s of t,llc s111,t~rcc rootc:tl a1 L l ~ l t il~l.c:n~nl ~ ~ o t l ( : .
LC:(. SI i111d '5'2 IJC! t8 l lo ~, \vo il11)11t S(!CILI( : I I~~S to 0111. ;~lgori t , l l~n o f Ic11gf.11 / I / i111(1 71, ~ q w ~ t . i v ( ~ l , y .
\?To biiil(l i t st~ffis t,rc?c! for S1 :111(1 ~ , I I C I I s o ;~ rc l~ f;)r s~ l l ) s c~~ i~c~~~c :c s o f S:, o v ~ r t.his s~~Ili.u t,rocl. \\I('
\\'is11 t.0 f i l l t l ill1 lliil~illlill llli3I;CIl(!~ 0f illl(l & t.llat Ilil\r(? ~lifflllll(!ll~; S(:01.(! H ~ O V C il. t.lll.(~S110I(1
s . Tllc 11i1ivo ~nc:t,hotl is to h l . for onc:l~ slll-)scqi~cnc.c "ii o f thsc: pilths of 1 . l ~ si~tIix tr(:(,
st ,art , i~~g 21t. t,he root wliose Iiilx:l, ~niit,checl with SI,, has an n l i g ~ ~ n ~ o l ~ t scorc grcutvr t11al1 s .
Not,ic:c: t,l~c:ro (:i111 I)c ;I lilrgc: 11~1111)0r o f SU(:II 1)il.Lll~ a l t l cvc:l~ t,llc! 111ost d f ic ic~~t , i1lpprit.11111~
k l l o rn l~ fi)r t l~is ~)roblcrr~ have vc?ry 11igI1 ri l l~~li l lg 1.imc il11(1 Sl)il(.(! L . ( Y ~ ~ I ~ L . ( W ~ ( W ( ~ makillg ( . I I (> I I I
ill~lxiict.ic:i~l [:JOj. r 1 l o ovcrconlc I his prol)lcn~, no 11111~1 co11si(lvr t1w s ~ . ~ I I c ( . I ~ ~ ( : oE i ~ ~ p i l . S ( : ( ~ ~ I C I I C ( ~ S . \<\;II(:II
t11c i11p11t s(yi1(!11(:(:s iir(! w r y s i ~ ~ ~ i l a r , t,l~c:rc call I)(: 111) to O ( r r ~ 1 i . ) ~rluirriiil 111~1t.cl1cs ( W I I U I wt!
II;I\T s11ol.t ~ l ~ i l t c l ~ o s i l l o11c soquc1lc:e r(:l)ei~t(-?tl I I I ~ I ~ I Y ti111cs i n t . 1 1 ~ 0t.11cr S(~U(:II(:(:S) while 1 . 1 1 ~
null~lwr o f I . 1 1 ~ irrdlors is at rrlost O(lnin{//~,, 11.)) (since i~llc:llors arc ~lol~-c:rossillg ~llat,c:I~c's, 1,11('
11111i11)(~1. Of illl(:1101.~ ilt. I I I D S ~ . ~([llill 1.0 t , i l ~ 11111~1~)(!1. Of' ~:llill.il(:t.(!l'~ i l l 1 ~ I c ~1101'L(!~t. S(YILI( 'IIC:(~).
111 t . l l is rxso I I I O S ~ , of t l w n ~ i ~ x i l i ~ i d l ~ i ~ t ( h ( , s wo (lisc:~r~l(xl (111rillg t11c ~ . I I ( , I I O S s(~Iw1,ioll ~ ) I I ~ . W .
T h r ~ s i t s1i1iicc.s t,o l i i i t l ;I slili~ll i i ~ ~ r ~ i l x r of lil;~silrral riik~[.c.hc:s; wc: c:licosc! tlloscl \\;it11 Irigl~c>st,
scort:. 0 1 1 t,hc ot.llcr Ilalid if t.hc input sc?qucnt:c~s arc? tlissilnilar, thc 11unil)or of ~riilxilnal
III~L~(:II(?S is I ~ S I I ~ I ~ ~ J ~ CIOSC t,o t,lw nri~nlwr of f i l d i1.11~1iors i l l wllic11 case orir algorithrri will lint1
~liosl; of tlw rriasii~lal ~i~at,c:licxs.
Dofilio 1 1 1 ~ siriiil;\rity value: I)c:twow t,wo sr~l)scqrlcxic:c? rY i i l ~ t l Y . tS(X, Y ) : ;is tl~r: score, ol
1 1 1 ~ ol)Lilri;ll i i l i g~ i~ i i~ i i t tlivitlctl by t,lic leiigt.li of tJic s~il)scclrlc:licc~ (wc o111y ca~lsitlor ini~t(:1ics
with S I I ~ ) S C ( ~ ~ I I ~ : I I C C of the silni(l 1~11gth).
Our s~iffis tree ~ ( ~ i i r d ~ iilg~rilhili works ;IS follows. Lot 7' I ) ( ? t l l ~ siifiis t,rcc. of SI . A
locatioii i l l 'I' is c:itllcb~. ;L iiotlc of T or a l)oi~lt, 011 all c:tlgc of T tlli~t: splil,s tIw lal)(:l o f t.hc
cdgc intm t,wo s:il)soq~~rnc:c:s. For ci~c:h sulfix S$ of S2, wo ti ld iill locatiolrs 11 ill T sricli t.liiit,
tl~cb Iiil,rl o f tlio pat , l~ fro111 t,l1c: l,oot, of ?' 1.0 p Iias a Iiigh si~iiilwrit,y valuc: \\.it11 solric: ~)rc!iis of
S;. Algorit.hm 6 ricpid.s our sill-lix 1.rc.c scil.rc:li mc:t.llorl for a S I I ~ S; of &. Lct. S; = .S2[r1, rr] I)e il sul-Iis of Sz i ~ l i ( 1 l ~ t P 1 ) ~ tlic set of Iocat,iolis in 7' rct~irli(:(l 1))'
aIgorit,lilii 2 Sor S.1. \Vv ~ i o w filitl t l i ~ sc.1, of iricsact 11iatc41c:s I)ctn.coil tllc: 1)rctixc.s of S; al~t l
s~ll)sc?cl~~(~nt:(~s of Sl. I+r ~il(.ll 1~)(.ilt.i(.)11 71; ill P. 1 ~ t I)(! I.Iit-: l a i d OF the. 1)ilt.li frow root ~f
7 ' to 7);. L(8t R , I) ( ! t l ~ c sot of' oc~~r l , c : l l c . c~ of Y , i l l .Sl . Not.ic:o I Iiat oc:c~irl.c>licc~s ol' 1: i r i Sl
t:orrcspo~~rl Lo t l ~ c Ii lhls of lhc pill.11~ from t.11~ root. 1.0 p; t.o ;I lcilf of t h s~~ l ) l . r c~! mol,c!tl ill.
pi. Tllerof'orc! I?; (:art be com1)utc:tl efIicic:~~t,ly by travcwing t,he s~rl)t,rcc! of T root,ctl a t pi.
Orlc(> tlic sct R, is corlil)rlt.c:el, for cad1 sr~l)scclwlrc:c\ Sl[r.,. 7;. + (I:( - 11 i l l R, tJio i i l g o ~ ~ i t l ~ i ~ ~
o i ~ t p u t s t . h (5, .TI, IY, I) ;IS ill1 ir~txilct, 111atc11 (SW Figure 5.2) .
This p r 0 ~ : ~ ( 1 ~ 1 r c sig~iiiica~it,ly rcx111ccs s(!1~(4i t,iliic l),y l)ruiii~ig tlic s(:arcli spti.(:t!. Ilo\v~:vcr,
c:oi1i1)rit,irig t h o1)l,ii11;il idigrlrlcnt I)c~.\w(:II the: Y, i~ird SG 11si11g N(:(!(ll1iiil11-~4~1111~(:11 r(!quirt!s
qiiatlr;l.tic tillie a1ic.l t.oo slow for our 1)urposcs. To ovcl.colilc this, t h c t l y i a ~ r ~ i c progr~l i i~~ui l ig
is lilrrit,(xl LO il l,i111d of widtlr 11 i i . ~ ~ l i l ~ d L l i ~ l i l i i i l l (liiigo~li~l (Fig~rrc 5 . I ) L l i ~ rc111r(:ing r~lririilig
time to O(tllY,I). Not,ic.c t h t iulcllors found tliis way do )lot c.olita.ill lolig S(I~II(:II(:CS of i l d ~ l s ;
lolrg scclric:ric:c?s of ilitlcls i t r t . 1 ~ c~orrsc:rvcd rc!gio~~s will orrljr I,(: tlet,ec:t,c:cl nl11c:rl wc! c:losc. gal)s
I)~t\v(!(:rr the ;in(-Irors. ~\lgorithlri 7 (It!s('rilxs ~ I I C 1 ~ 1 i i ~ i 1 1 1 d ~~~ritt~lr-fintlirrg rnot,hotl.
A f h linclillg i ~ l e x x t n~atclres, tile algorit.hl11 nost i(hltiiic!s l l l ~ ~ i l l l d 111alch~s. li) (10
so, wc: sort t,lrc rllatc,lic~s wit,lr ~,c:sl)oct to t,lic'ir loc:atio~~s i r i o~rv of t,lic sc:cl~rcricos. thtoct 1 . 1 1 ~
~lon-nla.xilllal ~~iat.c.lics. RIKI ~ ( ? I I I O V ( ~ t,lr(w~ fro111 tile SO^ of I I I ; I ~ C ~ I ( I S .
Algorithm G Swrr l i i l~a 0 1 1 'I'll(, S~lffis 'I'rrc'
Fig~irv 3.1: To 11ii11w t8Iw Dl-) ro11ti11c f'ast,cr, o111y t11c iiwa ~ I I O W I I will Iw ~ ~ o i ~ ~ r ~ x l , Tllis r (w~l t , s is sliortcr sccplcwccw ol' ilitlcls i l l tlrc ~nat.chcs
Algorithm 7 l:ilitlinr$ h'liisi~nal Mat.c:hcs
CHAPTER 5 . G,4 IJNA
'I-
5.3 Selectsing Anchors
5.3.1 Filldi~lg Lnrgcst Total Wcight Non-crossing Anchors
Fig11rc 5.3: A sct oi' 1111c11ors is (lq)ict(!d. 1~cct;111glcs ~ O ~ ) I . C S ( : I I ~ t h : 111axi1ui11 t ~ ~ i ~ t , c l t ( ~ s a11(1 ;I
sot of gootl at~cllors is tl(~l)ic.tcd i l l \vltit,c rc:c:t.i~~~gl(,s c:ott~~cctc!tl 1)y t l ; ~ s l l c ~ l lillcs.
5.4 Closing The Gaps
5.5 GAUNA Parameters
11. \Vi t l t l l (1, of t,ll(. I)nlltl n r o l l ~ l t l t . 1 1 ~ l l l n i l l (liilg~llill O f the (lylalllic ~) l '~g~i~l l l l l l i l lg tiil)1(!.
To aligl~ t11o s~il)sccpcl~c:cs ol' S2 wit11 t l ~ : p t l i 1iil)c:ls of' tlic suffix tl.cv, tllc suffix t l w
sc:;~,c:ll nlgorit,l~ln rc.st,ric:ts t l i c (I>r~~ill l~i( . p r ~ g r i ~ l l ~ ~ ~ l i r i g t:11)1~ t,o :i 1.);111(1 of width IJ ~ I Y ) I I I I ( !
t , l~( ; 111i1i11 (liago~iid.
GAUNA Results
6.1 Exact vs. Inexact Anchors, Spccificity Evaluation
6.2 GAUNA Parameter Settings and Results
T a l ~ l r 6.1: GALJNA Spcc.ificity
S e c l w ~ ~ w s i\~lc.l~ors Specificity
co~~si ( l (wxl in 0111' I ( % L prorc!ss ar(: i ~ \ ~ i l i l ; ~ l ) l ( ! t ~ t . ~1f,l,~~://\\~\\:\\~.~>i111~.111;1t,~l.(:i1/gil1111~/.
For (:01111)itriso11 1)11rpos(!s all p r o g r t u ~ ~ s \v~r(! ~ I I I I 011 tl I h u x I I I ~ I ( ~ I ~ I I ( ~ \vith tl. 3.4 GI1
iutc4(R) Xcto~l('l'ibI) pl,oc:clssor ;m1 2 GI3 of RAM.
\V(+ 1 1 0 ~ ~ : t11;~t for illig11111(?1.1( of ~ I I I I ~ F L I I and (:1ii111p, G A U N A (:IIos(' ~ ~ i \ ( : t i l ~ ~ ( . h o r ~ ilS ~ I I ( : s ( >
c:o\~!rcd niorn t i i i l ~ ~ 50% of tliv SI-:(LII(:IIC(~S. 111 tllc ~ t h r C:ilSCx t,he 1):~ranic.l.cr s r t k i ~ ~ g s IVPIY!:
I( = {25, 10. 7): E = 1500', (1 = O.S> m c l .I,: = 7 (a clcwriptiol~ of ci~c:]~ par;u~lctor ( X I I)(!
fo1111c1 in S o c t i o ~ ~ 5 . 5 ) .
011r c:ol~~l)ariso~is of CAUNA, LAGAN: AVlL), h l U h [ ~ ~ ~ o r , ;ml MC;A arc: s ~ ~ ~ r ~ n l a r i z t x l in
'ri\l)l(: 6.2. CVo tclste(l f . 1 1 ~ o t l m tools ( n ~ h c ~ o ~)ossil)lc), to [ i ~ ~ c l t . 1 1 ~ parnrnc:tc:rs t , l~ i~t 111ilsi1nizc
tllcir pe~rfor~~~alic:('.
'li, Illcli~srlrc? t,llr! fl1l;llit.y 01' tllc! iihglllll~llt.~, lillcl 1 I IC illiglllll(:ll(. r(:giolls (llilt l i i l ~ i ' il
high ~ L ~ ~ ~ I I I I I ~ : I I ~ X . O ~ C :111i.1 C:OVPI. II IOY( ' t l ~ i ~ i 10% ~f iL l l ( 'soI~. TIIO t ~ t , i ~ l 101lgtI1 of ~ 1 1 < , 1 1 wgioi~s
cl(!tcrilii~ic~s t,lic\ cp ;~ l i t ,~ . of ail i \ l ig~l~~i( ' i i t i ~ l i ( l is sIiow11 I I I I ( ~ ( : ~ t110 C : ~ o ~ : r q j c ( : O ~ U I I I I ~ i l l ' l ' i l l ) l (~
6.2,
'li) l ~ i i ~ ~ ( l l ( : r o g i o ~ ~ s of (liffor~v~t. I c ' I I ~ ~ ~ I , \\T (Iof ~ I I ! t h ~~orli~iilin:cl s w r c for il l1 a l ig~l~ucut
region t,o bo $, wl~orc? s is tlic? score o f t,hc illig11111(:11t,, b is t,hc 1~1igt,h o f tllc i ~ l i g ~ ~ m ( : ~ ~ t , i l ~ ~ ( l P
is thc ~nasimrun value ill tho scoril~g 111at1.i~ (note that, nor~nalizc!d score is alwilys less t h ~
1). A 1.cgio11 t . l l i l t I~ils a 1lor111alizcc.l st:oi.cl al)o\.o 0.8 is co~~sitl(:rc>tl as a lligll i ~ l i g ~ i ~ ~ i ( ' ~ l t s(:or(>
29:%37=17 CiAUKA 1 l(i27Ci5 LAGAS
AVID IIlJhIillor Tvl G A
'lYw r(~111ts of global d i g n t ~ w ~ ~ t ~ s for (Iiff(w11t 1)rograms. Tinw is iu sc(:or~(ls i\11d 111(~1ory is in III(!~~II)J~~.c:s.
7.1 Measuring An Alignment
ils \IT lalorn !lot c:fIicir~~t,) t,o alqdy, for t,lic. (:i\scl o f I>iologi(:i~I soq~lc!~~c:os, it ilocs 11ot. provicle
11s \Vif 11 11111(~11 illf01.111ilf i011 ilI)~llfd t!lO illig'lllll~llt,. SO WC 1 1 ( ? ~ d 11101.(, sOl)llisti(:il(.(!(I \V;lys Of
tirfi~lccl a c:ol~se~,vr(l r e g i o ~ ~ to I) ( . all i~lip,l~lll(:l~t region t , h t 1~1s a h i g l ~ a l i g ~ ~ l ~ ~ c n t scol.c: i111(.l
covcw I I I O ~ C t1li111 10% of an c : so l~ . Tllc rc?i~sol~ w l ~ y csolls are i ~ ~ ~ p ~ ) r t . ; i ~ i t : is L11iit fro111 t h :
SOLLII(I t ,I l :~t, (lo ]lot, ov(:rl;tl) \vit,l~ i1llJl CYWII , \w t11i11k t,lt(!,y 111ig11t, still I ) ( > o f S O I ~ I C i l ~ ~ ~ ) o r t a ~ t ( x ! . , 7 1 IIC 1)ioIogicd ~ i i o l i v i ~ l k ) ~ ~ l w l ~ i ~ i d 1.11~ i r ~ ~ p o r t i i ~ ~ c c o f ct~11scrvcc1 r (y$o~~s (:i11(1 tlicrdorc (XOIIS)
is illkit i C i l 1,(:gio11 11ils ( : I I ~ I I I ~ ( Y ~ litt,lc over t,i111(:, i t IIIIIS(., I ) ( > rcsist.i1.111, t.o 11111t,;llio11s i111(l so
tllrrc is R gootl rl~allcc I.hal, if. has t m n of sonic! import i111cc lor the lifi: o S L I I C spc.cic.s.
'I'hcroforc?, i l l IGAIJNA 1)csitlcs looliil~g a t thc c:or~sc:rvod rcgio~ls ovorlappily, esolls, wc!
il.ls0 look ;it t-llc OWl'idl (:OVCrilg(! of cOllScI'Vt!tl I'(!giollS il.lld 0111. 111~il~111.~1ll0llki ill.(! I)a.S(!d 011
I)ot,ll t , l l c w fac:t,ors.
7.2 Improvements to GAUNA
7.2.2 Branching
1: ~ ~ ~ O I I L ~ S ~ = C ; ~ : I I ~ S ( ~ ~ I I L (S2) {fil~rlir~g t11v pot,c,l~t,i:il scst o f esons i l l S 2 ) 2: lCiAUNl\(se(ll~c>r~(,(> '51; s ( ! t [ ~ ~ t ~ ~ ~ c e &, ;~rray I<[l , . . i~l(lt>x i . 11iatd1%1 111, 11o(lt: s11fhx-
r 7 Irtx~Rool., c:sol~s c ~ s o ~ ~ l i s l ) {I<[ l . . . 11: ;I list of l(mgt,l~ ~~lir~:sl iol~ls for a ~ i ( , l ~ o r s i l l (lifI'(w~tit r t , c~~rs io~ i s : I<[l] ;> K[2] -, . . . > I<[[] } { i : t,lw c.11rrr11t i ~ ~ t l c x for I<}
:$: if ( ~ S l ~ ~ S 2 ~ ) is s ~ ~ k : i t : ~ ~ t l , y s111;111 the11 /I: {c:so~lList is t . 1 1 ~ sct of poter~~ttul c w ~ ~ ~ s } ,;: Aligrl Sl i111t1 5'2 11si11g h~cc~llc111:111-\4~~111scl1 aIgorit11111. i : rcturli 7: crld if 8 : i f i > 1 then 1 : rc turr~
{L(YI\Y! ,Yl : I I I ( I S2 ~ ~ ~ ~ : i l i g ~ ~ o ( l } 10: crid if I I : Ch11 1~inclhIi1silir:tl (5'1 , S2. I<[i]! :\I, sl~flix'l'rc:c:l<oot.. tw)~lI,isl,) . 12: A(l,j~lst, 111;1tkti wigtl ts of 111at,t,Iles i11 111 lx~se(l 011 t~xous i l l cxo11List ;11i(1 t,lle ~ ) ; I I ~ I I I P ~ P ~
z41~lI'LIFYING_Ri\TI0 1:s. S(:l(:c*t. a s111xwt of n~~c : l~or s wit11 I I I ~ ~ S ~ ~ I I I I I I I t,oti~l wt'igl~t, ; I I I ~ 1)11t t l l t : ~ ~ ~ ~ I I tlic li11a1 alig~l-
I I I O I I ~ ()I' S 1 i l ~ l t l S 2 . 1.1: for cuc.11 1);1il. of itltcr-ar~c:l~or s c c l ~ ~ c ~ r ~ c ~ ~ s , .Si i ~ r ~ l .S,i do I . ~ , , C:111 IGALrKA(.5'{, S i . I<; i 4- 1, :\I. s~~f l ix ' l ' r (~t~I~oof , ! (lsol~List,) itlig11 S{ i11i(I S;. i ( i : end ~ O I .
17: reL11ru
As cliscussetl in Socl i o t ~ 5.2, tjlrc: trl;lilr p ;~r . ;~~l~otcrs 11sc~1 in GAUNA (wl~icli are kept t,llc Sill l l( '
i l l ICALJNA ;IS wvll) ;II .C K - v ; ~ l ~ ~ c ~ s . :, sill~ili~ril,!; t,llrc~sl~oltl .s. a ~ ~ t l tllc cliagol~al wi(lt,l~ ( I i l l 1.11(:
U P 1 ill)l(~.
' ~ I I ( > i~~t,cl .-al~c,l~or 1(!11g:11 I11r(:s11oItl E ! is ;L llll~osl~old t l ~ t ( ~ ~ I . ~ : ~ I I I ~ I I c s \ Y ~ I C I I t o us(: N ~ W I I ~ ~ I I ~ I ~ I -
WIIIIS(.II i ~ l g o ~ . i t l ~ ~ u i l l i~ wgion i l~s lwt l of fi~itlil~g III:I( . ( . I ICS. I t is sv1. st1(.11 l . l ~ t , I.110 0 ( 1 1 ~ )
1 : FinelMasirn;~lh~I;~t~cl~crs ( S C ~ L I ~ I I ( : C St, S ( Y ~ L I ~ I I C C S2 , irit li: set A / , 110i1c S I I I I ~ X T ~ P C ~ ~ O O ~ , c~solis c\xonList) { A t will lioltl ~.l.~:it.c:l~c?s}
2: J'=(ij
:{: for t/' sulfixc~s .$ of S2 do ,1:
: if S? h;ls ; ~ n o\;c:rlap wit11 ;MI (XOII i l l (~01i1,ist then (i: CMl S(:iir(:lirrr(!(:Bri~~~(:l~i~~g ( ~ ~ ~ I l i ~ ' T r ( ~ ( ~ r < o o t , , .S2, I ) , Vj, k , (>xo~~I , i s t ) 7: clsc 8. C h l l Sc;irc.IiTrcc. (s~~liisTrc~c~Root,! &, I ) , 0, k ) : e n c l i f
10:
I 1 : for Vp; E I' do 12: 1%1(l oc.c:urrcw:es of p, ill SI ~ I . I I ( ~ i ~ t l t l t h n i to A.I 1 : cncl for 1.1: cnd for 15: R(wlo\,t! I Y ! ~ ~ I I I ~ ~ ~ I I I ~ , ~ u a t d ~ c s
rn i~tdws i111tl h d p s us c:ont,rol tIi(* r~lllning the I)y lilnililig tlic spilcc t l l i~t Dl' (:o\~cPs. AS
;I sitdc-('lf(!ct, i t iilso lilllits t,il(: i~liglllllC'lll, I > ( > Of' il ~ l ) ( Y ' i i l l (,YI)C tll;lt, (1O('s llot ilIIO\\ ' 11101.(!
t.lli411 d c:or~scuttivo gaps i l l tlw ~ l ~ ; ~ t , c l ~ o s . 'rllis is ~~c.c.ol)t;~l)lc~ l)i:causr! for ~l~at,cl~c:s c:ol~l.ail~illg
long c:ol~scx:~~t.ivo gill) il~t,cwi~ls, wc (:ill1 view t,h~sc! ;IS two sq)arat,c\ I I ~ ~ I ~ ( : I I ( > S i ~ l d i(lu~tiI'y tll(w!
i l ~ c l ( ~ ~ ) ( ? r ~ ( l c ~ ~ ~ t , l j ~ .
IJifi(!d 011 our c*sl)cril~lc:llts, I<-values grc?al.ly ;lfi!ct ttlc sl)c7c!tl all(] quiilil,)f of tllc! illig11111~11t.
i111(1 ('i1li 1 1 1 d i ~ 21 111i1,jor ( l i h w ~ c ( ~ in t IIC: q ~ ~ : ~ l i l ~ y of t,Iiv solutio~i, \VP l)(~rfor111(~1 (~xt(!r~sivc t(>st,-
illg 1.0 (l(:tcrlili~~(! I II(! qui~li ly of t,Iw i~ I ig l~ l l~<? l~ t I ) H S O ~ 011 tlifl'erc:tlt sets of lC-ixlllcs. rutuitivcly,
the I I IO~Y: 1 ~ ~ ~ ~ 1 s t , l ~ ~ r c arc. t,11c 1011gcr IGAUNA t :~ l i~s t,o ~ I I I I . l'lic l)iggx>r tlw K - \ ~ I I P . t110
longer tho nlatdtes wc f i ld shoultl bc aid tllrrc?forc if w: clloosc too largo ;I K-vi~luc.: \\I(!
llligl~t lil~(l ,jl~st, ;I fc\v I I L I I I I I ) C ~ of' I I I ~ I . ~ C ~ I ( ; S t , l~;it (lo I I O ~ ( :o\T~ 11111c~l1 of' tIl(5 two S(Y~I ICI IC( :S ill111
CXIAI'7'ER 7. IGA CJNA 57
7.3 Optimal Alignment
a l i g i ~ ~ i ~ c ~ ~ r l s .
7 .
Ill(: r c ~ i ~ i l i ~ ~ i ~ ~ g rcgio~is hot\vm~i~ CSOIIS. ; t r ~ iilig~~(!(l 11si11g rog111;1r IGAtJN.4 routirws~ I l l ( !
w y wc! have! t1i:sc:rilxxI t h i ~ i l l t hc prcviuus srct ions (as opl,oscd t,o using ULAST). 'I'IIc
hi id r ~ s u l t is i1.n illiglllllc?l~t wil.11 ~l l lp l l i~s is 011 1 1 1 ~ illig11111(:1lt. of O X O I ~ regions i ~ ~ i ( l iti \vc will
sot: 1i~1.s a lliglicr scorc! wki(!n w: col~sicler colisorvrtl regiolls ovcrli lppi~~g csolis.
r 7 1 [I(! acIvt \~~Li~g(> CIS \iilvilig t l i i > s O - ( ' ~ I I I C X I opL'iv~(~d (I&!JIL,IIL(Y~,~, is t,liill-l i t , giws 11s a 1 1 w t , i ~ ~ i i l t , t ! ol'
I lOW 11111('11 il ~.('jillIill. ~.liglllll('rlt, l l i l~ t,ll(! pOt.(?llt,i;ll t ,O i1?11)1.0\'0i1 i l l tP1~111s Of c!xorl digl l l l l ( !nt~.
If t l ~ c optinlal iiliglll~~ont has il sc:orc I I I I I C ~ I Ilighcr t,lian ii giwn ;digrllll(mf., wc h1ow LI1i1t.
tlltrre is still 11111('11 1.00111 for i~nprov(m(wt i l l t.110 Loo1 Li~ilL procI~c:(~l tlic illigl~~~i(:lil.. IIOLV(IV(II,:
il' t,ili! givP11 illiglllll~llt S(:orCS VF1.y c~OS(!~\; t 0 tall(! 01)t.illlill ;lliglllll~llt: \\I(! kll0W tllklt t.hV ( l l l i l l i ! ,~ '
ol' tlic: i \ l i g ~ i ~ ~ ~ ( ' l ~ t ( , i i l ~ i o t I ) ( > i l l~l)ro\:~~(l 1 1 i u d 1 . 111 t.liis c:;~s(>: 0111. foc11s will 1 ) ~ ' 011 i l ~ i p l . o v i ~ ~ g
t l w spctcd of t l . 1 ~ tool antl how 1nuc11 sp;1w i t IISCS. So 1)asically if t\4V (liffcrmt 111~111oiIs
protlucc t.wo a l i g l~~ l~c l~ t s with scores very (:loso tu tllc opt i l~l i~l a l ig l l~~l~l l t , , t.lli:l~ t 1 1 ~ o11e w11ic.h
1)rotlucx:s tlic idignl~lont f;~stc?r ; u ~ l with lcss mcluory, has I.hc ;~.tlva~ithgc! o w r t l lo o t h c ~ OII~!.
7.4 IGAUNA Parameters
Usillg tlill'c:rc~it pari111ic:tcw gives IGATJNA t , l ~ c . Il(~sil)ilil,y to IN' L I ~ for tliffcrcnt cxsc:s casil!..
By 1)r01)crly sc?t,l.i~g t.llrw l)arn~~lc?t.c:rs, IC:I\UNI\ call il(.tl~;llly 1.1111 (:x;~c:t.ly lilit: G A U S A o r i t
call I)c c:sc:c:ut,ccl t.o Iiilcl t.he opt-i111al ~ I I ~ ~ I I I I I C I I L . Bc!siclc,s t,lx o~ lcs nrc 11;lvc. ;~lro~~.cly clc~scril)otl
IGAUNA Results and Conclusion
8.1 Experimental Settings
8.2 Parameter Settings
(20: 7) { 'LO, 10) (25, 10) (25, 7 ) ( 3 0 , 15) (20, 10, 7) (25, 10, 7) (35, 10, 7 ) (35, 20, 7) (50, 30, 12) (40, 20, 10, 7) (45, 25, 10: 7) (50, 30, 10. 7)
c:o~~sitl(!rctl 1.11~ followi~~g I<-v;ill~c srt.s: (20: 7 ) , (20, 10): (25, 10}, (25, 71, (30 , 15): ('20,
10; 71, { 2 5 : 10, 7). (35, 10: 7): (35, 20, 71, (50. 30, 12) , (40, 20. 10, 7): (45. 25, 10: 7): (50:
30, 10: 7 ) . Wit.11 e w r y ~:Iiiu~g(: ill the sct of' I<-values. 1GAUNA's yerl'orm;i~ic:o coi~sisl.c~il lp
cha~igc!tl in ;ill t . 1 ~ sl)cc:ic:s, t21~c!rt?l:orc t.o s l ~ o w tllv c l~iu~gcs , \vc! will olily s l ~ o w t,lic: rcs111l.s for
U I I C 1)ilir of sl)v(:ics ( I I I I I ~ I I - D ~ ~ i ~ l i p i ~ ~ ~ e n t ) .
As call 1)c sc!c:11 fro^^^ TaI)Ic 8.1, c l ~ a ~ ~ g i ~ ~ g I<-valuc: scts (wi t J~ i i~ rcaso11u1)lc values) tloc's
not ~ f k c t scnsitivitv an(/ spcc:ifil.y sig~~ilicant,ly. Howewr t,he r u n l ~ i l ~ g the (:iin j~ii l ip at S O I I ~ F
p o i ~ ~ t ~ s . The S ~ I I I W 1)at,tcr11 of j u ~ ~ ~ p s i l l tinlc: applies to ot,licr specic:s a s wdl, hut witli wrying
i~~tc:i~sit.ic?s from 1.5 t,o 4 tinics inc:rcilse i l l t i~iic.
Loolci~~g at 'I';ibl(: 8.1 reveals t,lliit si~lcc: tlw ruli~iiug t,ii~ic? (low ~iot , ( : l i i l i i ~ ( ~ 1 1 1 1 i ~ l i ((:x(:q)t
for 111(! ~ I I I I I ~ ) p o i ~ i t . ~ ) , 1.0 gel. 1.lic lu;i.uiir~~~rn scnsil.ivit,g ; i ~ i t l spcx:ilil,y: wc: sho111tl c:lioost? l , l ~ >
I\'-wlue sct (25, 10: 7 ) .
8.3 Alignment Results
8.3.1 Memory Usage
Progrim TCL ECL KEG' TCLE Tirr~c (s) k I ( m (t~tl))
tlie scqucwces a~ i t l feetling t1lc111 L o GeiicScan, ~ v c can s o l v ~ t , l ~ i ~ t prol)lcl~i. Figuiv 8.1 sl~ows
CHAPTER 8. 1G24 UNA RESIJLTS A N D CO,VCL USION
In terms of specxi, the overl~cntl t h l . GCI.I~S(.~LII (:;LIISCS, is the I)igg:.~st. f'a(:li)r ill rc(Iu(:itig
IGAUNA's spertl co~upiircd t,o GAUNA. Honrc!vc:r: it is possible to c?s tract t,ho c+so~is I)y
IGAUNA ,111ow tllis convc~~ie~ l t ly ) . Bri~nchi~i:, lias it11 i~iflrlc\~~cc on ( h c s p c ~ ~ 1 its wcll. h t
paramel;c!rs. Ovc~ri~ll, illt.hol~gl~ b r a r d ~ i n g cloos rc:tlllce the spectl, it, docs not retlr~ce i t 11111(:11.
Table 8.6: EIurnan Mor~so Aligr~tr~rnt. Rrsulth
P I ~ ~ I ~ I I I TCL ECL E 'ELI3 T i l ~ w (s) M(w1 (11~1))
0ptirn;d 110875 5957 63 7568 97 9 1 IGAUNA 141853 5928 6 7431 130 90 G U N 12G716 5922 62 7451 5 2 S ,5 I 123425 5862 Cil 7351 14 1 205 I\V I L) 2 5777 (i0 6081 60 4!1S
Prograrli I ECL NEC T C 3 X T ~ I I I ~ . (s) kIc111 (1111))
O p t h a l 381350 11237 60 17505 394 2% IGAUNA 405007 10891 5 l(;9!)1 4 1 0 280 GAUNA 997121 lOS(i5 55 165.54 382 27 1 LAGAN :365457 104j9 56 16284 659 76 1 AVID 231496 1082 7 1929 238 1'307
8.3.3 Quality Of Aligrlrrierlts
111 ortlcr to 1)ctl:c.r cso~l~p;u.cx IGAGNA wsults wit11 i,llc other tools nicwt,ionctI, it \\.ill I)(:
I~clpful t.o c:o~~sitlcr g rap l~s sl~o\\w ill F igr~ws 8.2 I,o 8.5.
14s wc can SOP in E'igrlrcs 8.2 m t l 8.3, IGAUNA ~)crforms clrlit,c w ~ l l whcn it co~l~cts t.o
csoli cowrage. \\'11on t :ol~sit leri~~g moll (!o\lt'ri1fi(! (o~i ly cco~~sit lori~~g CXOIIS t , l~at I~avc: Iwt~11
c:ovc:rctl ~ n o r o t h n 50%) imtl also t o l ~ l cxon c:o\lcra.go ( co r~s idor i~~g all f . 1 1 ~ c:onsc:rvc!d r c g i o ~ ~ s
that fkll i ~ ~ s i t i c c so~ l s ) , IC4AUNA perfornis l)ctt,er tI1a11 d l t J ~ c o i . h tools; wcll, c:sc:cpt [or
ClIAPY'ER 8. IGAUNA RESULTS AND CONCLUSION
Menwry Usage
Human-Rat Hmar i - H~~ran- Hurren-Dog Mouse-Dog Mouse-
Exon Coverage Length
I kl ouse.Dug lul ouse-Chc ken
Figure 8.2: E x o ~ Covor;igc? Lol~gt l~ for Mousc-Dog ant1 h?ousc!-Cliickc~l
CHAPTER S. IC;AIJNi\ RESlJLTS A N D CONCLUSION
Exon Coverage Length
Figorc. 5.3: Ex011 Coveragc Length for H~~man-Ritt , , IIwmiin-Mo~~sc. 111i1nall-Chic~kcr1. 1Ium~il1-Dog
Total Exon Coverage
Total Exon Coverage
6 000 0
I IiumafkRat HunianM iruse Hurnat~Ch~r: ken HumanDug
tlle Hutuil~l-Dog (:ils(!. Mi(: c:otisitlcr t,liis (:;IS(: ill tl(:t,ail:
S~lrl)risit~gly, 1,AC:AN scorctl cvcll Iliglicr I.ll;ln o ~ ~ r nptilili~l aligtitriollt,, i l t ~ l \vo i~i\wt, i-
giltcxl c?xl)l i~~li:~t , in~l~ fbr 1,liis "inriclclit,". Aftcr aligliing tlic Hurllati-Dog socl~~otlc:es, \vc> 01)-
sc~rvctl that. LAGAN is not usi~lg tllc c:olrlplct,c scc1uclic:cs i l l it,s fitlal n.ligtlnwt~t,: Wo (!stri~(:Lctl
t,llc: origit~al S C ( ~ I I ( ? I ~ C O S fro~ti tllc ;~lig~iccI S C C I U C ~ I ~ C C S i11ic1 fo~lllcl t,llitt the Ictigth of tl~c:. sciclrlc5~~cc~s
usccl by LAGAX \wc?rt? (j/lTjllK3 for soclt~c:~~c:c: OIL(: (as oppostxl to &l!)GI)OO of tlio u~,igitiid) ii l l (1
5033G45 for scqllc:nc:o t,wo (;IS ol)posc:tl t o 6424515 of t . 1 ~ origitml). 111 chssctlcc, LAGAN
is "tlmnvil~g away" parts of thc scxlool~co that could not I)(! i-lligl~(!d p r o p ~ r l y (i.e 1% of
tllc first scqllcncc? and 16% of the s(!(:o11(1 s ~ ~ I I c I ~ ( : ( : ) . This rcs~dt,s in i i higlicr c1cnsit.y of
conscrvc!tl rcgio~ls (eit l~er in tol.al or just in t.hc cxons, tlcpcncling on wltcre tlie t , l lro\\~~-
il\vil;lr scg111~11ts I~il\r(! I)c:ct~) a.ud t,llercforc 1,AGAN will score h i g h nrllo11 ~ ~ ~ o a s u r i ~ ~ g t l ~ c
co~tscrvc?tl rc?gions. T l ~ t c:xplailis thc huge tlifferelice ill our Exo~l-Covc.r;lgc:-Lo~~gtll ;ultl
7i)tirl-Cot1sc:rvc.tl-Lc1lgt,l1-i11-Ex cstiillat,cs 1)cl.wcc:ri LAGAN i111t1 even the opt,i~~t;ll itligll-
8.3.4 IGAUlVA Improvements Compared To GAUNA
to scc: how IGAIJNA hm inil)rovcd, \vc bast our con~parisu~ls o n GAUNA m t l (:01111)arillg Lo
1 1 1 ~ opl i~rial i~ l i g l l l~ l~~ l t , \VC sco how n111(:11 room Lllcrc is fur i~ripro\wnrnt. 'I'ho11 wo scc liovi
Total Exon Coverage Length Improvenient (Compared to GAUNA)
We call sec I.hat IGAUNA has i~~iprovetl wtmut. 5C)X comparctl t,o GAUNA, ~ncal~ing
l:lii~.t. it. has cowl-ctl aho111, 50% of t.hr pot,mthlly c:o\:c!ra.l>lc rogiolis 1101. prwio~lsly c:o\w.cc-l
by G A I J N A . Thc only csccpt,iol~ is hIolwx~-Cliic.kel~ i~lig~~rncnt, , Altho~igll I G A U N A Ilits
iliiprowd i l l t,l~is case as \vclll it, SNXIIS t,liat, Llicre is st,ill 11iuc11 11ior~ I , O O I I I Lor i~i i~)ro\rcl i~o~~t~
(i I . l l ( I id1 t l l ~ Ot,hCl' ~ O I S ill'(' fil~~illg hhill(1 i l l Illis (C?lsC ;IS \vCll),
In tornis of' total c.o~iservc?d rcsgions: as 11m1tio11c:cl earlier, we ( : i ~ l l l l ~ t I ~ S C 1 . 1 1 ~ opti~ili-11
;~lig~l~r~crlt, for cotnpariso~~, bc!c:i~~~sc it 1)iascs c x o ~ ~ s too 1nuc1i ant1 t.li;it 1iiig11t. i ~ ~ t ~ r f c r c wit11
t,he alig~r~~lcirlt of ot.lwr potcrit,ii~.lIj: good r~gions. Thcrdoro we colupare IGAUNA o ~ l y t,o
GAUNA alid ;IS wc cull see fro111 F ig~~ros 8.7 itlld 8.8 a.11tl the t.a.l)l(~s in Scct.io11 8.3. IGAUNA
has i~nl)rovc!tl from 5 t.o 12 1wrc:cnt.. Thc o ~ l y esc:ciptioll is Hlmm-Rat itlig11111e1lt. whew \v(\
s w c-mly i~l)o~ll. 3% i l~~pro \ :e~ l~c i~ t (\vliick is t , l ~ c s lo~igcst scqtiencc. in our tc3st sot.).
Mouse- k g htlnusc-itwken
I Total Conserved Region
Human-Ral Human41 ouse Human Ch~c kerl Human-Dog
CIIA PTER 3'. IGA IJNA RESIJL'LY A N D CONCIL CJSION
8.3.5 Summary of Results
8.4 Conclusion and Future Work
it. is oflic:iciit lmth ill tinlo al~cl space. Depcnding on wlictl~er spcctl or q~~ ; i l i t y nec~ls to I)(!
opt,in~izc:d, IGAIJNA's pirsit~~~rt,t!rs (:a11 I)(! fl(!xibly set t,o suit cach ir~tlivitluiil cils(!. 1 1 0 ~ -
cvcr t,hcrc is sl,ill 1iirrc41 roonl t,o i~ lcn r~)o ra te more biological Iicv~sistks illto it, in or(lcr to
iic~l~icvc-: l)t:tt,t!r ~xwdts . 111 t J ~ e salllo way tJint G A U N A is capal)lt: of p e r l ' o s l ~ ~ i ~ ~ g 11111ltiplv
scqllrmcx: ~ l i g ~ ~ ~ n ( : ~ i t , , w (:a11 l'l~rt,llor (!xt(>11(1 t,lit, (:;~pabilit,iw 01' IGAUNA t,n I ) ( ? used or1 111111-
roo111 for rcLilli~~g tllc (lcfillition u i d LlSilg(! of an opti111a1 aligtl l im~t wIiic11 rcquircs il l)ctt(:r
untlcrstantling of biological scqucllccs U K I scorillg nlc~lhorls. 'I'hc:rc is also ;I polcntial Lo
illcorporate p;iri~llelism int,o IGAUNA 1)y divitlirig t,lle query st ,r i l~g iirld feeding (>ilt.ll 1 ) i ~ I .
ir~t,o a separate 1)roc:essor. M'c i l r ( : i\.lsu worki~lg 011 d(:vc:lopitl;: a user frielltlly i~lt,c!l.filct: for
IGAUNA alicl 111aIw it: avaiI;~l)lc: as i~ s t ~ ~ ~ ~ ~ l - a l o ~ ~ c progrml ii11(1 >11so >is >I \wl) i~l)l)li(:i~tio~l.
[!)I C . I ~ l l I ' g e a l l t l S. I(ill ' lil1. P l . (Yl i ( : t io l l Or ( U l l l p l e t . ~ g(!IlC ~ t I . l l ~ t l 1 1 . ~ : ~ i l l I l l l l l l i l l l g( : l lOll l i ( . ( I l l i l .
,I. ~Llol. I? Io l . , 2G8:78-!)4, 1!)!)7.
L3113LIoC:I?ll PI-IY 74
[14] D ~ I I I Gusfic:ltl. Al!lor.itliiris oil. St~.irr!~s, fi.cc.s (md S C ~ ~ L C ~ L C P S : C O ~ I L ~ , U ~ C I . SC~CYICC ( i d
G'orry).~~lnliontd Biolopy. USA: Chlnl)ritlgc! Univcrsit.y I'rws, [19!)7] (19!)9).
[l(i] C:il)son T . T l ~ o n ~ p s o ~ ~ J.D. Higgins D.G. Gillson T.J. Higgius D.; T ~ O I I I ~ S O I I .l. Clustd w: inlprovil~g the swsit,ivity of progrcssivcl~~ultiplc scqucmx alignnlct~~t tl~rougl-I secluel~cx: wc!igl~lil~g:positio~l-s~)ecific gal) peualtics mt l w i g h t rria.t,rix choico. Nucdcir Acids Rcs., 2'2:4673-4680, 199-1.
[24] Gcrolcl I<arp. G'dl c~ntl ~r~olrv:~rlnr. biolotpy : concr~pts arul cxpe~irne7its. C:l~icl~est,er : John \YiI(:y, 2005.
[:32] C r ~ g o r y h'l. C:oo~)cr h,lic:l~iwl F. I(il11 El~fi( '~io I ) i ~ \ l ~ t I t ) ~ NISCJ Co1111):trativc Stxjli(~i~ci~tg l'rogri~nl Eric D. C:rcon Arcrd Sitlow h1ic:llacl Urrltll~o. Chllo~lg U. L)o il11~1 S~ri11il.11 Bi~t:zoglou. Lagan :trltl r r ~ ~ l t i - h g m : Efficic~~t. f:ools for lar~e-scalc ~ r ~ ~ ~ l t , i p l ~ i~liglllt~(:nt. of' gclloli~ic d l l i~ . G ~ ~ r m r ~ i ( , RI : s~ (L~( , / I . , 13(4):721-731, 2003.
[34] S. B. cPr. \Vu~~sclt C. D. Necdlcn~an. A gel~cral r~~et l lo t l alq>lical)lc to tllc sectrrll for s i l t~ i ln r i t i~ i irl the i ~ m i n c ~ acid scqucncc of t.wo prottr i~~s. J. iLIol. B i d , 4P:d4:3-453, 1970.
[Xi] 14'. R. P C ~ ~ S O I I it11cI D. .J. I,il)111at. 11ttprovtxI t.ools for l~iologici~l seq~~t:~ic(:s c o ~ i i p r i s o ~ ~ . h v c . !Vubl. Acc~dmvy S(31:nc.e. S5:2.144-48, 1988.
1351 .J.P, h/Icsirov B. Bcrgc:r S. J3atzoglou, I,. Puclltc:~ a ~ ~ t l E.S. I ,a~~tl( :r . Hu111al1 ; t l l t l ~ilouso g t ~ l c str11cA11rc: Ch~ipa ra t ivo u ~ i ~ l y s i s a l~ t l applicaliol~ to cxon protlictiol~. Geriorrcc! Rc:stwr(:lr, .Jr~ly 1 2000.