88
IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING NON-EXACT ANCHORS) All rigI11sr ( ~ s ( ~ \ w I . 'll~is work UIIIJ~ uol. I)(! ~(~~)Lo(I~L(~IYI 111 \vliol(!or ill st,, l)y ~)l~oto(~i)~)y or ot.l~c~ rllcwrls, w i t h o ~ ~ t 111(~ ~wr~llissioil of tlw i~~~t,ll~)r.

IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Embed Size (px)

Citation preview

Page 1: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

IGAUNA (IMPROVED GLOBAL SEQUENCE

ALIGNMENT USING NON-EXACT ANCHORS)

All rigI11s r ( ~ s ( ~ \ w I . ' l l~ is work U I I I J ~ uol. I)(!

~ ( ~ ~ ) L o ( I ~ L ( ~ I Y I 111 \vliol(! or i l l st,, l)y ~ ) l ~ o t o ( ~ i ) ~ ) y

or o t . l ~ c ~ rllcwrls, w i tho~~t 1 1 1 ( ~ ~wr~llissioil o f tlw i ~ ~ ~ t , l l ~ ) r .

Page 2: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

APPROVAL

Name: IlIwsoutl H i ~ r i ~ ti

Dcgrce: M A S T E R OF SCIENCE

Titlc of thesis: IC;i\UMA (Irtlyrovctl Global Soquc!~ice Aligrmlc~~t IJsing NOII-

cXsac:t A11c:llors)

Datc Approved: l o ~ + I ~ / Z ~ C - F

Page 3: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

2% SIMON FRASER @ ,,,,,,,l~Iibrary &.g

DECLARATION OF PARTIAL COPYRIGHT LICENCE

The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users.

The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection (currently available to the public at the "Institutional Repository" link of the SFU Library website <www.lib.sfu.ca> at: ~http:/lir.lib.sfu.calhandlell8921112>) and, without changing the content, to translate the thesislproject or extended essays, if technically possible, to any medium or format for the purpose of preservation of the digital work.

The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies.

It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.

Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence.

The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive.

Simon Fraser University Library Burnaby, BC, Canada

Revised: Spring 2007

Page 4: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Abstract

Page 5: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 6: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 7: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 8: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Contents

. . Approval 11

... Abstract 111

Quotation v

Contents vii

List of Tables x

List of Figurcs xi

1 Introduction 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 I\[olivnl.iotls 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 01tr Cont.ril)ut,ions 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Tl~clsis Orgarri./,ill.iott 3

2 Background 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Biolngic:nlB;~c:l<gror~tltl 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Biologic.al 'li.rlr~s 6

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Cot~tp~ttt!~. Scic~~c:c. Ri~(:kgl.o~t~ld S

Page 9: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3 P r e v i o u s W o r k 0 1 1 Globa l S c q u e n c c Alignment 15

3.1 Scorillg hdct. l ids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 L). y t~ i~ t l l i~ . l'rogral~ll~ling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.3 ~ \ ~ l ~ l ~ o r - E i ~ s c ~ t l / I I i t . Mel.hotls . . . . . . . . . . . . . . . . . . . . . . . . . . . . lr3

3.3.1 FAST!\ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I!)

3.3.3 CI.IAOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3.4 1,ACAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

9') . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 GLASS ,,

3.3.6 htIlJh1111o1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . 7 C111sf aI\Y 23

3.3.8 AVID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4

13 . X . (3 h. l C: i\ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1

4 Gc~wt.ic. Algor i t l~n~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 I l i t l t lc~~ .\ 1ilrk0~ l ~ l ~ t l ~ l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Gcrlcra l A l g o r i t h m s F o r A n c h o r - B a s e d R4etl iods 27

'1.1 U~~ilcling S ~ ~ f f i s '1'1.1~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 High Lc.vcl Ukl~mc!t~'s Algorit.lin-1 . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 S~wc: t lu~~ Tecl~l~icl~~c., Part 1 . . . . . . . . . . . . . . . . . . . . . . . . 2'3

. . . . . . . . . . . . . . . . . . . . . . . . 4.1.:3 S p c x ~ l r ~ p T d m i c p ~ c , Pi1l.t. 2 33

4.2 Fincliug hlaxili~~~lr~-\Iic:igl~L At~cllor Set. . . . . . . . . . . . . . . . . . . . . . . 34

5 G A U N A 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 kIc:t. llotl tlesc.ript. ioll 37

5.2 Finding h;I:~xiln;d Il~cx;~c:t Mat~c l~ i s . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Sc lcc t i~~g A ~ r l l o r s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.3.1 Fitltlit~g 1, i ~ r g ( > ~ t rli)t. i l l \l.?\igI~t . No~~-c.rossit~g A1lc11o1.s . . . . . . . . . . 44

5.4 C:losillg Thc. GILI)S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 .5 C;I\UNI\ I ' ~ I ~ ; I I I ~ o ~ . o ~ . s 45

Page 10: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

7 IGATJNA 50

7.1 h.Io.~sul.illg 1\11 Alig11111~11 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7.2 I ~ n l ) r i ) \ ~ c ~ ~ i c ~ ~ ~ t s to CAUNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5') 7.2.1 Esoll \Akigl~t. Aclj~lstlrlclll . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2.2 13ri111(:11illg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2.9 Pwri~lllctcr O p t h i z i l t i o ~ ~ . . . . . . . . . . . . . . . . . . . . . . . . . . 54

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Optillla1 Alignlll~'l~t. 57

7.4 1GAlJX.A Pi~ran~c:t.nrs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

8 IGAUNA Results and Conclusion 50

1 Eq)c~ri lno~~t;d Settiligs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5!)

5.2 I'i~r.l.lll(!t.(:l. Scttiligs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

8.2.1 I<-valr~cSet. s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

8.3 A l i g l ~ ~ ~ ~ ~ l l t . R.esrllts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 . 1 hI(w~ory U s q y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8.3.2 Spcctl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

8.3.13 Qrli11it.y Of A l i g l ~ ~ l ~ c l ~ t s . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 .. %.4 IGi\I!NA I ~ n [ ) r o \ ~ c ~ ~ ~ ~ c ~ l ~ t s CC)III~):L~(:(I 'Ii) GAUNA . . . . . . . . . . . . . 68

8.3.5 S I I I I I I I I ~ I ~ Y of r~osults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.4 Co~~cllwiori nlltl l7ilt3r~re LVork . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Bibliography 73

Page 11: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

List of Tables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 GAUNA Specificity 47

6.2 GAUNA Global Aligurl~oiit llctsults . . . . . . . . . . . . . . . . . . . . . . . . 4!J

8.1 I<-valrw ERwt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1

8.2 R'Iousc Dog Al i l ? ;~~ni (~n t r t (~11 ( . s . . . . . . . . . . . . . . . . . . . . . . . . . . 6:)

8.3 h~louso C11ic:kw Aligritilciit Rrsu1t.s . . . . . . . . . . . . . . . . . . . . . . . . ti3

8.4 I I u r i i ; ~ ~ i Dog ~-\ l i : l ; l i r l le~~t I l e s ~ ~ l t s . . . . . . . . . . . . . . . . . . . . . . . . . . 6.: 1

5.5 IIuinmr Cliickcri Aligliiiic~it . Rrs111t.s . . . . . . . . . . . . . . . . . . . . . . . . 64

3.6 I l ~ ~ ~ i l i i - l ~ MOUSC i ? \ l i g ~ l r ~ ~ c ~ ~ t . R ~ s l ~ l f . ~ . . . . . . . . . . . . . . . . . . . . . . . . . 6 5

S.7 H \ I ~ I I ; ~ I ~ R.i~t. A l i g l ~ l ~ ( . i ~ t . R(w1Ifs . . . . . . . . . . . . . . . . . . . . . . . . . . 6:)

Page 12: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

List of Figures

Page 13: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

sii

Page 14: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Introduction

Aft or tlw cliscovcry of DNA ill 1953 [5], our liriomlctlgc: of orgal~is~l is ant1 t h i r I )~~ i l t l i~ ig st.riic:- 7 .

I rlws 1 i i ~ v;istly growll, I his 1~1s i l l t,urli ros~ilLo(I i l l 1 . 1 ~ c:reat,iol~ ol' 1 1 c w I ~ t x ~ ~ c : l ~ c s of scicincc

tl(!tli(:ill.(ltl to s t d y i n g t.hc ~li;iili b u i l d i ~ ~ g I)loc:l<s of' life. bIolec111;ir 1)iology: gc!~lc:tics i i ~ i ( l KC-

~loliiics ilrc-: silc.11 llcw fioltls bra~icliilig from biology ulitl \vit,l~ t,lio c v c ~ groniil~g iilvolvc?nio~it, of

111ilI I i~ l l~i l l . i ( :~ a.nd c:oll~pr~l.cr sc:ic:licc in I.kicsc! fic:ltls? ilisc:il)lincs s l ~ c h ;is c:ornput,i~t i o l d 1)iology

hilvc~ I ) c ~ l i croiltctl. Gclictic:~ is t,ho a r w of Ihlogical st,~lcly c'onc:erllotl \vit , l i licwtlily alitl

wit.11 t . 1 ~ vilriatiolis lx!t.wc:el~ nrgai~isl~is t11a.t rcs111t fro111 it.. G~!llotliics is i L rcc(~~it . sci(:ntil-i(:

disci1)lilic wit,11 the i l i l i l of (Idini~ig :ln(l (:Ii;~rii~t,(:rizil~g t l i ~ (~)1111)1(>t,c gc~~cbtic 11iiaIw111) of i l l 1

orgialiis111.

Tllc i~ll icrmt ~~li~tllol~i;~t,ic:;\l s t , r \ ~ < : t . ~ ~ r ~ of DNA i111~ l t I ~ c i \ lgori t . l i~~~ic E)~O(:CSSPS 11scd t,o cx-

prrss prot,cins hiis Ictl t,o ii claw cx)llal~ori~tio~l I)c!l.\vecrl n~olec~llilr I~iology, comput,i?r sc:iellc:o,

~ i ~ i ( l i ili~tlloi~~iit,i(~s. 11s il. rcsillt.~ cc)l~il)l~t,i~t.io~ii~l biology I ~ ~ I s I ) N ! I ~ (,rc~ilt,o(I wliicli is [ i l l ~ I I ~ W -

tlist.iplirl:~ry ficlrl thnl i11)l)lic:s thc~ t,cx:ll~liclric:s o f c-:c-mpr~t,or sric.~icc: i11ltl applic!tl ~niltlhcni;~t.ic:s

to prol~lolus i~lspirccl Iy biology. As i t tlisciplilw, c,oiill)lit,iit.io~l;II biology is H rc!lat,i\~c!ly I I C W

ficlcl l)111, I ,II(w 1i;ls I ) W I ~ i~ virlmu;~l ( q > l o s i o ~ ~ o f work i l l 1111iwmiI,ics, g o v c r ~ ~ ~ w ~ i t , r(:smrch l;~.l>s

:\IICI 1 . 1 ~ : pl'i\,i~t(! S W ~ O ~ .

Tllis l iv I ( l is r c l ~ ~ t , i v d ~ . yo1111g i l l l t l l . ( ' ~ t ! i l l , ~ l ~ i l l tllis I i & l 111ili11Iy ~tilrt(!(l i\l't('~. <,st;~l)-

lisllnlerll, of Tlw IIluiiarl Gcliollle Pro,jc.c:t. (IIGP) ill 1990. 111 1!)W the U.S. I)('~)ilrt.lll(!t~t,

of E~icrgy (DOE) o s t , ~ ~ l ~ l i s l ~ ~ d t,Iic Microl~i;ll G C ~ W I I I P I'rogri1111 (MGP) ;IS a c o ~ ~ ~ p a ~ ~ i o u t,o

I IGP ~ I K I s inw 1.hc11 I ~ I ~ L I ~ J ~ ~ I C W cliscovcrivs llil.~(! I)(Y:II ~u i~c lc ill this fidd. A I ~ . l i o ~ ~ g h l'ro~ii l h

c :o l i i~)~~tor sc,ic!l~cc:/llliit,I~(\r~lilticill 1)oint of vicw, a I:w algoritlll~ls \vit , l l rcq)ec.t t.o sc?cluc~lc:c

Page 15: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

a Ions 1.1 Mot iv t '

01, livillg O r g i ~ l l ~ s l l ~ ~ , (!ill1 11~?11) 11s l l l l ( ~ ( ! r s ~ ~ l l ~ ( ~ Illor(! i 1 ~ ) 0 1 1 ~ , t~iscxsos i l l l ( I (~~!sigll i l~g b(!t,t(!r (\rllgs.

Thcreforc mc ~~c:ccl sc:cluc:l~c-:o i ~ l i g ~ l i ~ ~ ~ ~ ~ t , ~ll(>t,llo(ls tl~ilt give 11s Iligl~ quillil,y ;~ligilnl(:~~ki 1 0 I ) ( '

u s ~ l L o cxl rilcl, l)iologic:i~lly vd11i11)Ic~ i ~ ~ f o r i ~ ~ i ~ t , i c ) ~ ~ i l l ) c ) ~ ~ t , l i v i ~ ~ g o r g ; ~ ~ ~ i s ~ ~ ~ s .

Wit11 L11(, o v ~ r g r o w i ~ ~ g 1)iologicxl tl;~ti~l);~sc:s s11c.11 as NCBI (N;~t , i (>~~i l l R~:SOII~CC for I3iot~rli-

110Iogy T~~Soxx~~iitiol~) l ~ I I I ( I I'D13 (I'r0t~i11 DaI,it B;I I I I<)~? ~ I I C I Y ! is 21 I I I I ~ : ~ I I I I ~ I I ~ I ~ of I ) ~ O ( Y S S C ( I

nlld 11111)rocc~ss~tl tl;~t,;l i~vi~ilablc lo sc:icntist,s. Solnctin~cs to l i ~ ~ t l a gootl st,;irt.ing point t,o

lhttp://www.ncbi.nlm.nih.gov/

' h t t p : //www . umass. edu/rnicrobio/rasmol/pdblite . htm

Page 16: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

1.2 Our Contributions

on R ty1)ical tlt!skt.ol)/lal)tol) C O I I I ~ L I ~ , ~ ~ i l ~ l t,llcir c:sc.c:~ltion t , i r~~os is ~ ~ s r ~ d l y too long to l)c

11sc~1 frc:quc~t.ly. T h r ot,l~c:r o ~ l r s t h t rccl~~irc: less ti111e to c!sec:ulc, prucluc:c: lcss rdiiiblc

r~s111 ts wit11 IO\V(T, :iil(l s o ~ n c ~ t i ~ ~ i w I K J ~ , i l ( ' (X?])~; i lkJ](! q ~ d i t its.

IV(: i~~t,uoduc,l: IGAUNA, a ncw algorif I I I I I ~ I I I C ~ progri1111 t,o f i 1 ~ 1 glol~al l~;~irwist! a l i g ~ ~ ~ ~ ~ c - : r ~ t - s

wil.l~ w r y I~igli q11a1il.j. results ant1 ill ;i very oIIicic:nt lllilnncr, cwxl csc:c:~~I.i~ldc: otl a typic:;d

lal)t,op for large? s c ~ l ~ ~ o ~ m ! s .

\V(! ;tlso il~troclr~cc: il I ICW way of' ~ ~ l c a s u r i ~ ~ g the qlliility o f scquc:~~cc!s i ~ ~ ~ t l i~~t.ro(l~lc:c ;L

so-c:;dl(:tl opti7rin.l glob;~l wlign~llcwt I.)c:t,\vc:c:l~ t,\vo soqucwcc-s \rrhic:l~ c;ui I)(? ~lso(l 1.0 ~lloi~surr!

tllc i lwli ty of a givc11 i ~ l i g n ~ l ~ c ~ ~ l , .

1.3 Thesis Organization

Page 17: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Background

Page 18: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

2.1 Biological Background

obtw.iiwtl from 21 t.llcwct.ic:id p o i ~ ~ t of view s l~o~ i l t l l)c c!sha~~st.ivc:ly tt?st;cd in t 11e liilxi to ~nillic'

s~lrc: it tloes llut ~ I ; L V C illly ~ i l i \ j ~ ) ~ . side (:fr(:ct.s.

11i1vi11g t . 1 ~ almvc p o i ~ ~ t s ~ I I 111i11c1, t IIP ~ C ; ~ S O I I S why 011c s l i c ~ ~ ~ l ( l I)(! f i~~i i i l i i~r wit.11 thv liolog-

difl'orc~~t, (losiigo ~ I K I tIr11gs ~ I I X ! prwxil)wl for ~IifI'crc~~t, 1)cc)pI~ ~ic~cordillg t,o t,I~c>ir ~CIIOIIIC: ,

i.c. t h wholr hc~ .~d i t , a ry i l ~ f o r ~ ~ l i a t i x ~ o f ~ I I C o r g i u l i s ~ ~ ~ cnc:oclcd ill DNA, wl~ic:l~ ~ n i t b ~ t heir

r w l ) o ~ ~ ! s to tl~c:riq),y (I issimil~~r; i ( l c~~ t , i f i ( :~~ t , i (>~~ of (1r11g Lnrgcts w11id1 arc l)rot,cins \vl~os(!

h111(:t,io11s G L I I h! 111odifim1 scl~!ct,ivdy L I I I C I l l t ~ l l ) t,o m r o i~ disoasc; ; 1 1 d Iilst, 1)11t 11ot l(~ist ,

itlc11tihc:;tt,io11 of m i s s i ~ ~ g or clc:f(:c:t.iv(-: gc~~c!s ant1 rcplacw~~cwt o r s r ~ p l ) l y i ~ ~ g 01' its protl~~c:t.s

Page 19: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

2 .2 Biological Terms

Page 20: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

CHAPTER. 2. BACKGROUND

r 7 1 I I V t , t : r ~ ~ ~ i:xoI1 WilS coinctl I)y t l ~ : i\rwric:a~i I,ioc:I.lc~~ilist. MTalt,rr Gill)c>rt in 1978 5 TTl ccsolis

arc t,yl)ic:iilly multiples of thrcc nuclaotitlcs (cvcry triplet of I m c s callctl a c:otlon is t r i~ns la t td

illto ccrr;;~ili amino ;~c:itl [2(i]). But, ~ i o t all ~ l i c inforuiat,ion irisi(lc t l ~ c D N A is cxpresscd ;is

prot,cins or IZNA, some regions of the DNA SO(~L~CI~( :O ilr(? (levo(;(:tl t,o c:ont.roI ~n(!c:li;xnis~~~s. -- - - - --

' h t t p : //en. wikipedia. org/wiki/Talk: Exon

Page 21: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

2.3 Computer Science Background

:q)pro;~clies to solve t,l~is prohlo~n.

Exact string I I I ; I ~ C ~ I ~ I I ~ ~ I I I ~ S(?(111011(:(? a l i g m l c ~ ~ t ar(> 1.(!li~t.i\'t!l3: old topics O F (:0111p11t,~r

sc:icnc:c brrt tlwir rcw!r~t ctxt.e~~sivo ilpl)lic:;lt,ioris in bioil~ti.)r~ll;~t,i(:~ l l i ~ ~ r ~ s ~ ~ l t , ( x l in r (~n~w(x1

;~l , tc : l~ t , io~~ t,o t,11cw problclrls.

Page 22: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 23: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

2.4 Computer Science Terms

Page 24: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 25: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

W I I ( ~ ~ I l>~~i l ( l i~ ig ii suflix t,rw for a. st,ri~ig, soi~wt.ii~ics a. suffix of' tlii! s t , r i i~g ('ill1 Oc 1)iIrt. o['

i\ l o i i g ~ i ~ s~illix :11i(1 t,l~c:r(!Sor(~, its ~ ( 1 l)osit,ioi~ iiiiglil, I N , oti i l l1 i:(lgc o r k111 iiit(w1n1 I N N I P of

t11(, ti,ce. l b t11lsrlre t11a1. c!;1(:11 s ~ ~ l l i s i~ct,rii~lly c:ritls at. ;I I w f , a r ~ i ~ i c l r l c ! <:l~i~ri~<.t.('r \vlii(:lt is I I O ~

l);l.~'t. Of L ~ c ~~~)~ l :Lb( ! t , is ~ t ( ~ d ( : ( ~ ti) LhC Vll(1 of Lll(! ~ h % l g . l ' l l i ~ (IliIril(:t('r 1 l ~ l l ; l ~ ~ ~ ~ d(!llOk!(l 1 ) ~

S i ~ i l c l is c:all(:tl tlto t c r - inr id s!jitr601. I-Iciicc, t,lle s11f1-ix t rcc is ac-:t,tii~lly built oil ,S$. Figuro

2.1 sl~o\vs ;I suflix t,rc:c 1)uilt oil 111~ sc:c111~11c:c ATTATC:.

Page 26: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

r ; ~ ~ ~ ( l o i ~ i l y g o i ~ c r a l ~ ( l c ;~~ i ( l i ( l ;~ l ,w , of' ( :o l i rs~ , iimsl, \vill 1101 get, i l r ( ! ;~so~~i~l) l ( ! s(:orc, :11i(1 thy

will IJC tlclrt,etl. IIowcvcr! ;I fow of tllc inst.i~nc:c:s rr~igl~t. gct ;I roaso~~nl)lc scoro (sl~ow nc:l.ivit,y)

a t d I.h(:sc citll I)c 11sctl Iowil~.tl Curt.hor solvil~g Lh(: p ~ ~ l ) l c ~ ~ i . T~ICSC c~111t1id;~tc:s arc Itc,pL i l11~1

Page 27: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 28: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Previous Work On Global Sequence

3.1 Scoring Methods

Page 29: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

i\ 111oro gc:~~rr;d i \ ~ ~ d st,ill ( : ~ S ~ : - ~ , ~ - V ; I I ( : I I I ~ I ~ ~ C ~ 111(>t 1 1 0 ~ 1 OF i~ssig~~ilig S(:OIW~ is t i ) IISC ~~latxicvs

this c:oi~t,aiii t l ~ ! sc:orc of ~)ilit'-misc' illigl~l~l(!llt,s I)(:~\\Y!(:II (:\'Cry pair of t.110 ilIl)IlilI)(!l. ])('il~g 11s('(I.

Srl(:li CSRI I~ I ) IC!S ; I ~ O PAM [I21 i i i i ( l BLOSUR4 [l5] whic,ii ar(? witlcly rlsc:tl in prot,oin aligilinorll,

algorithnis. \:Vc IM: a s i~i~i lar nlatrix for pairwisc ~u~c:loot,itle scorc?s i l l IGAUNA.

Tlio c:l~oic-.o of scoring ftinctio~i (:;111 I~nvc? a grwl ii1il)ac:l; 011 t,ho cp~alil-y of thc fiml ;1lig11-

Page 30: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.2 Dynamic Programming

d i ( w rr( : l : . ! I ) giws 1.he s w r o o f aIigni11g (d~;~ri~<:t,t>r :c with (:l~i~.ra(:t~cr :y.

U s i ~ ~ g this rcx:rirsivc! cquat,io~l, wo call clyri;~nlicillly l)uilcl ;L ta1)lc a~ l t l LM: 1 1 ~ ; values s t ,o rc~ l

i n 111~ l ) r ( :v io~~s r o ~ v s / ( ~ ~ I ~ ~ ~ l ~ ~ l s I,o ~ ~ l t i ~ ~ l i ~ t , ( ~ l > ~ (,;il(:~~lat,(! V 1 1 ) + 1 ~ 1 1 + 11. [Jsii~g I llis f;tl)l(!. \v(,

ci111 t.rt1c.e Imck ; L ~ ~ c I 1)uilcl tlic, nct,linl i ~ l i g l l i l l ( ~ 1 1 t .

B;lsc~I o i l t,hc N(~cc.lloiria~i-M'r~~liscti i i l go~ i t l i~ i~ , 1 1 1 ~ illit,llors i n [33] ~ I I I ~ I ~ V C t l ~ c SI);L(:C co111-

pI(.sit,y of N ( ! ( ~ ~ I ~ C I I I ~ I I ~ - \ V L ~ I I S C I ~ i t l p r i t l l ~ ~ l 11~1 11si11g ii 1 1 i w t,rirl< i n t .11~ LIP t.i~1>1~: II~sLcw(I O F

I<(!(.l)illg tllo \\l!lol(! h l )h! , t>llc!y Ollly h : l ) t,llc! hist ro\v a.ld c:ohlllll i l l l (I t.h0r(!f01Y! t.ll(!y lIS(!

l i~~c i i r sp;lc:r? ill t,hc Ic~igt,h o f tho i l~puts , i w t l l.her(! is no c l l i~ .~~go in t,l.~cl r i ~ r ~ u i l ~ g I . ~ I I I C . If 1 . 1 1 ~

il(:t,l~i~I o p t i ~ l ~ i ~ l idignnwlt is (Iosiml ( i ~ ~ ~ t . ~ i l ( l o f , ~ L I s ~ , tl~e! score of ~ I I C o p t , i ~ ~ i i ~ l a l i g i l i i ~ ~ ~ ( ; ) , tell(!

r1111ni11g t,iiuc will i l~c rwsc , l)ut, t,lic 11iag11it11dc will slily t,lic S ~ I I X (i.c. q ~ l a ( l r a t , i ~ ) .

Witli il sliglit. c:l~;~llgo to t,llc' for~~llilil (Iw(:ril)td ;1.1)0\~:, t l i ~ Sliiit,li-\li;lt,(:rllia.11 algorit,i~ln for

h :d il l igl l l l l ( ! l l t~~ (2111 I)(' O\)ti\ill(!c1 [ S G ] . I t 1lil.S (~llil(Il.ilti(: l.llllllilll?; tilll(! ;111(.1 SI)ilC(f ( : 0 1 1 l ~ ~ ~ ~ ' ~ i t ~ ~ ' ~

Page 31: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.3 Anchor-Based/Hit Met hods

3.3.1 FASTA

Page 32: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.3.2 BLAST

\vllorc li is i1 (:om~)l~till,lc I ' u ~ l c t i o ~ ~ of S i111tl 7) is tllo prol)i~l,ilit,y of l i ~ ~ t l i ~ ~ g a11 LISP wil.ll

scoro grvat,r!r or (~111:il l.o S [?:%I. r 7 1 t~is rosi~lt giws ~~l(!iu~il lgfi~l SCIIIRII~.~(:S t.0 S: C;ivcm sin: 01' 1Ii1! (I;lt.i~l);ls~ 1111d i ~ . scorilig

syst(!111> t,110 rw111t d c t c r ~ ~ ~ i l ~ c s wl~at, I I I ~ I I ~ I I I ~ I I SCOIW \vc IW(-XI look for i 1 1 O I Y I W to 11ot gct,

r i~ndo~rl I~it,s.

'Tl~c g ~ ~ ~ c r a l i1lg0rit.tl111 of BLAST call I>(: s u l ~ u ~ ~ a r i z c d iis li)llo\\rs:

Step 1:

b'ind all S L I ~ ~ S ~ ~ ~ ~ I I ~ ~ ~ I ~ ~ ~ ~ S of IPII$II I V , S I I C I I t11at tllcir score ; l g i ~ i ~ ~ s t , t 1 1 ~ q ~ ~ c r y (2 is ut h ~ s t

7'(< '7). I V is t.vl)icillly cclual to 3-5 i111d 11-12 for proleius i111tl DNA scc111c:llc~:s rcspcctivcly.

S t e p 2:

. '~;~l . ion; t l C:cw(c!r ti)r Iliof.c:c.l~~~ology I n l ' o r ~ t ~ ; ~ t . i o ~ ~ (o) http: //www . ncbi . nlm. nih .gov/ 4 r

I ' l l( ' ~ I I I ~ I I ( > ivc!1) ; ~ l ) l ) l i c , ; ~ l i o 1 1 V ; U I I N . t ' o l~~ ld i l l http: //www. n c b i .nlm. nih. gov/Education/BLASTinfo/ information3.html

"http://en.wikipedia.org/wiki/BLAST

Page 33: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

111 ortior to c:ol~q)ut,e t.he co~~lplcxi ty o f ULAST, wt! 11aw t.o kuow "w11;lt. is ;I c l ~ a ~ ~ c c li oL'

;I S-swriug sccl~~!l~c.c\ lmf. Ilnving a T-scorillg ~ror(1 OC size I/\/ ?". lSx1)(!1.ill1(:lll.;II rohl11l.s SI IOW

f llnt givcl~ T ~ I H I 1V, ~ I I P ca11 f i d n ;~ntl h sudl t,l~i~f 1; = e-("S+b). I3;1scc1 on t.llis, Int 14: IN.

;I nwd)c r of nlorcls gcncrat.cd 1'01. an i r ~ p u t (Illcry ill Step 1 a ~ d iU I)(: a 11111nl)or of rc!sid~~(:s

i l l t,l~c: tlat,i~l.)asc. T l m l colnylesit,y of BLAST is O(rrN7 + 6.n. + -). Aftw illt,rotluct,io~l of t,llr o r i g i ~ d BLAST, lllarljr tliffcret~t versiorls i ~ i ~ ~ ~ t : ( l a t ( l i f l iw~~f ,

t,yl)c:s of s o c l ~ ~ c ~ ~ ~ c c ~ s (i.v. nmil~o-acitls, protc!ins, otc..) ; ~ r l t l for (lill'(:rc~~t. pl;~l.for~ils w(>ro tlcv(!I-

opctl. Solrlc: of 1 . l 1 c w arc DLASI'N, I'RLAS'T. Ul,i\S'l'S, PSI-IJLAST, C:I\PPISD-I~LI\S'L',

J,IEC,A-DI,AST, 17St\-UI,t\S'T. MJ1J-I)L!\Srr, 711,1\'1', cst,c:.

PSI-Blnst (Posit,ion-Si)rc:ific Iterative BLAST) ;ml GAPPED 13LAST are int.rotlucrtl ill

(121. The idea. I ) c l ~ i ~ ~ t l GAPPED-BLAST is as hllo\vs: TIN: o r i g i ~ ~ a l BLAST Iillt1.s i1 si~lglo

\vc-)rtl of Icl~gt,ll u~ t,llnt scorcs a.t, loi~st ?' against tho clrlc:~,~;. B11t i f wc lir~tl t,nw n.ol.11~ ()I'

le11gt~11 6: ant1 sc:orc: T t ,hi~t lic O I I t l ~ e S ~ I I I C tliagonal wit.11i11 tlist,anc:e /1 fro111 each o t l m , t.11(:11

if 7' si~.t,islic:s cc,rt,ain criteria, mc just. c ~ ~ t c l ~ t l T i ~ l l o w i ~ ~ g gaps too, to rc;lc:h 7'. Usi~ig t.llis

mc:t,lloti, nv c>ntl 111) wit11 rilorc hits: l)ut, sillct: we (:over 111orc of thc: st,ring (1)y c:on~lcc:t,illg ?'

Page 34: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.3.3 CHAOS

Page 35: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.3.5 GLASS

1. For a n i~~i t iwl k , find all n ~ a t c l ~ i n g k - r ~ ~ c r s (k-long \vortls).

~\dlJ!l/lrri.c:r. ; I roliit.ively f ~ s t g l o l ~ l alig~l~ricnt. algi)rit,li~~i ~ ) Y ( ~ ~ L ' I ~ ~ , C ( I i l l [13] 7 . It. ~ I S C S s~il[ ix

trees t,o fild 111ilt,c11c!s I)ctt,\vc:c:~~ t.wo s t , r i~~gs .

hlUi\,Ii~ic?r usc,s ~ i ~ i l ~ i ~ l l i ~ l ~ ~ r l i q ~ w cxit(.t I I I R ~ C ~ I C S callctl 1\4 lil\%q as a~lcliors. The, ~iriicluc~lic~ss

Ol' a 1lliltdl i l l t,llc t.WO scclll('llcW 1llc:lllS t,lli~t tll(\l'e> 11~s t O h! Ollly 011(! (:OpV Of t ]I(' ~ l l t l t ~ ~ l i l l g

Page 36: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.3.7 ClustalW

111cnt.s '. It, is 111orc scwsit.ivr! tl1il11 tlic: ol.li(:r col-~~rriorily-~lsctI global i l l i g ~ l l ~ i ~ ~ i t ~iiot,l~~(.ls by

llsilig tJic followil~g ~~~ct l ioc l !':

*[I. is i\\viiliil)lr ; \ I http: //www. e b i . ac .uk/clustalv/\#

"http: //bimas. dcrt .nih.govjclustalv/clustalw. html

Page 37: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.3.8 AVID

011cc tllo rclc:r~rsio~l is c~)~~iplc:t~ctl, AVID nligl~u t , l lc \ rcrni~i~~ing ur~alig~lctl rcgio~is 1lsi11g fllc

N ~ ~ ~ I I I I I ; I I I - M ~ I I I I s ( : I I i ~ l g ~ r i l . l ~ ~ ~ ~ [34] i f ~ , I I c , v ill.(> s~~ fF ic i~~~ t , l y ~ l ~ r t . i111(l 01 hcrwisc I C ~ I V W I . ~ I ( ~ s c

regiolls rlnalignocl. h/IAVID [Ci] is a. progrcssivc: ~llnll.iple i~lig~~m(:~lt . t,ool t.hi~t il.lc:orl)orittc:s

AVID.

3.3.9 MGA

Page 38: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

3.4 Genetic Algorithm

3.5 Hidden Markov Model

Page 39: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

problcn~, t,Ilcu t,rans~ilissio~i a ~ d cniissior~ ~)rol>iil~ilit,ics slinultl I,(: tlccitloil I)y tra.ir~ir~g using

I~ 'o~ .n~c~r-r l / l l i i~(~k '~~~n~~~( l i ~ l g o r i t ~ h ~ ~ ~ s . Thew 121~ Vitcrhi ; ~ l g o ~ - i l , l ~ ~ n (:an I)c ~lsctl to itlig11 SC~IICII(:( \S.

O n e gootl Li.ilt,u~<! of H1\?1\'Ih is that they can l x rlsctl to itlrntXy dictlic!r ;I s c ~ l ~ ~ c ~ ~ c c .

I ) e lo~~gs t.o il piu.ti(:ulur Sa~~iily 01' s(:q~ie~lt:(:s (i.c. ~ ~ O ~ C ~ I I S ) [X ) ] . 1 1 o ~ v ~ ~ w r . this apl)roac11

is not as popr11;lr a s other nlethotls, I)cc:iiusc? tlic t.ol)ology o f t,lw IIMhI 1110t1ol is I~igl~ly

tk:pc~~tlant on t l ~ c ~)ilrti(:uliir ~ ) r o b l c n ~ ilncl t.hc sc:q~~c!~~cc!s I~cing stutlictl. Wc idso 1iwi1 a

largo 11urnl)cr of squc~~c:c:s in o r t l ( ~ t,o / , ~ . n i r s t . 1 1 ~ III\/I.\,I i ~ 1 1 ~ 1 find tlw ~ ~ ; ~ I I s I ~ ~ ~ s s ~ ~ I I / ~ ~ ~ I I ~ s s ~ ~ I I

prol)rll)ili t,ic!s.

Page 40: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

General Algorithliis For Anchor-Based

4.1 Building Suffix Tree

Using >I 11i)ivt: ;~pl)roacdi to I)uiltl a suKis two O I I ;I s h i ~ l g S[l..ll], tak(!s O(.r,,'') ti111o a~ l t l sI,ilct?.

\IT(! ~ . ; L I I tlo t , l ~ ; ~ t i l l ; I I I if,c:rat,ivc: \vay ils Sollonrs: 111akc tllc: t r w l),y 111aki11g a ~.oot, ;111d i l l 1 ctlgt:

t111d Ii1l)cl t,hc edge: wit11 t11c lor~gwt suffix of S, i.c. .S itself. 'I'11(:n talw 1 . 1 ~ : ~lc?st. s ~ ~ f l i s l)y

d i ~ ~ ~ i ~ ~ a t i ~ ~ g thc first, ch i r ac t , c~ of t h p e v i o t ~ s s t~f l i s ~ I I I C I l , r a \ w x t11e t r w s t a r t i~ tg f1.0111 tllc

root,. As long c:har;lc:tkrs nrc fo~u~c-l t l i i~ t rnatcl~ tllc (:11rr(wt s111Iix 011 thv t.rtx?, follow 1 . 1 1 ~

cxlgos ilrl t l l)ra~ich(ts. \VI~CII ;I cl~i~.rac:t,c.r that. tloc?s lot. ~l~atc:ll tl1c nest cl~ari~c(.er 011 the ~ N Y !

is c :~~c~or~~~t t : ro ( l~ crcatc a 11cw hri111(:11 a ~ ~ t l a11 ctlgo al~cl laI)t!l tllc ctlgc: with tllc rcl l l i l i~~i~lg

c.llim~c:t.t:rs o f t.110 ( : I I ~ I . C I I ~ s ~ ~ t f i s . An a.11 ~ I ~ I I R ~ ; ~ V ( I 1 1 1 c h ( l using t,hc S;LIIW i t l ( , ; ~ is I,o s t ~ r t frolll

tlw sllortc~st sufiis i ~ r l c l i~ t ld l o l l g ( ~ sulliscs in (wt:11 il.eri~t,ioll. Figuro 4.1 sllows t.l~is 1)rocc:ss.

1<i1(:11 s111Iis o f lc!~igt,l~ I I L , tci~n 1)c i ~ d ( l ( d i t , 1,o [lit: t,ro(~ in O( t i 1 , ) 1,i111c ;~ncI ll~crt:Sort> t,l~t> t,oIi~l

Page 41: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

4.1.1 High Levcl Ukkoncn's Algorithm

Definition AII h p l z c i t S,u[Jix 7i~f: 011 s txi~lg S, is a. tru: Ol>t,i~i~l~cl I'I'UIII the S I I I ~ X t r ~ c li)r

S l ~ y r(!lnovil~g (vor,y (O~>JJ of t I N > t(mlli11a1 syni1)oI !i fro111 t,11r (:(lg,~ lal)(,ls oc t,lw 1 r(x?, t lmi

l'('lllO~i11.g ;Lily cYlg(? tlli\L 1 1 W 110 Ii~h('1, iill(1 t I l P l 1 ~(2lllO~illg i\lly 110(1(: f.,ll:lt, ( 1 0 ~ ~ l l ~ t , 11ilv~ 1L.t

lcirst Iwo c11iltlrc.11. W r ! tl(:~~ot,c: t,lw iml)li(.it. sl~llix t,rw of t11(: s t r i l ~ g S[l..i] I)y Ii.

/ \]I il11pli(4t, s u l h t,r(:(! 011 S i l ~ d u l w all t,lw s ~ ~ f f i s e s of S, I)ut S O I ~ I C sulfixw ~ l ~ i g l ~ f , 11ot~

( : I I ( ~ at. i l h f . l2ig11rc 4,2(;1) sh)\vs ;UI i~x;11111)1(! o f i1.11 ilnpli(:il. s111lix t.ro(:.

Uku1111c:n's a lgor i t l~ l l~ is tlivicl(:tl i~lt,o 1ri ~)lii~sc>s. In phi~sc i + 1, trc:c: I , + L is c:onstn~ct,c>tl

flx,l11 I , . Eil(:Il l j l ~ i ~ ~ ( ; i + 1 is f l~rt ,I~(>r (livicl(:(l il~t,o ,i + 1 ( ~ x t , w ~ s i o ~ ~ s ~ O I I C for (lil(:ll of t,11(! L -I- I

s ~ ~ f t i x c ~ of S[l. . i + 1) . 111 clxt,rnsio~~ , j of pl~iisr i + 1, t h algorit,l~rn first h d s tlw c m l of tJlc!

1,i~Lll S ~ O I I I 1.11(! root laI)~le(I wit.11 s111>stri11g S[,j..i]. It. t11e11 C S ~ . C I I ~ S t 1 1 ~ S I I ~ S ~ ~ I I ~ l)y i ~ ( l ( l i ~ ~ g

Ill(! chi\racl.c!r .S[i + 11 t,o it.s cntl, I I I I I ( M .S[i. + 11 alrcwljr i~.l)lw;\rs I.11c:rc:. I I is jllst I,h: sil~glv

c(la~? li\l.)cl~tl 1 ) ~ ' ~ : l \ i l r i ~ ( : t ~ r S[1].

Page 42: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

I : Collst~rllc:t II 2: for i fsolll 1 Lo 711 - 1 do : I : {pcrformilig plli~sc~ i + I } 4: for j from 1 to I. + 1 do (7: {p(~rfoslnillg o s t , o ~ ~ s i o ~ ~ j) (i: E'intl t . l~e clld of the pat,ll fro111 the soot. lal)clc?cl S[j..i] ill tho currc!nt tmo. If ncetl(d,

est,c!l~cl t,lli~t I ) ; I ~ . I I I)y iltltlil~g cllilri\(:tcr S[ i + 11 t,o mit,kc SIIIY: Llli~t, S[:j..i. + 11 is i l l tall(: trcc,.

7 : end for 8: end for

4.1.2 Speedup Techniquc, Part 1

Page 43: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

i + 1 \\wr1<s.

Now, \\Y: (.i111 i ~ ~ t rotl~~c:t> a toc~l~t~icluc~ t.I~i\t will rc~luc:c: t,l~o \\:o~,st c:i~sc r ~ ~ l ~ l ~ i l i g l i 1 1 1 ( 1 of t l ~ ( '

i~ lgor i t l~ l t~ t .0 O(II ').

Page 44: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

I P 4 GEIVERA L A LCX)RI?'R.\.lS FOR ANCHOR-BASED I\.I/,:?'IIODS 3 1

Algorithm 2 Sirlglc 12xtc1lsio11 Algorit11111 (SEA)

Page 45: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

( :11;11 . i l ( ' f ( '~ g 0 1 1 t11(' ( ? d g ( ' i l l l t l C l l l i t s , 1 1 1 i ~ k i l l g S111.(' t , l l ; l t t.110 7 ~ ) i l t l l f1.0111 . ' i ( l l ) ( : 1 1 ( 1 ~ 0 1 1 t .11 i i t (?(Ip,(?

( ! ~ i i < : t , I y /J c : I l ; \ r ~ [ ~ r s ( I O I V I ~ its l ; i b ( ! I .

Page 46: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

4.1.3 Speedup Techniclue, Part 2

0 1 1 ~ 1)~01)1~111 t . 0 1)1.0(:(!(1(1 SIII.I.~ICI. il.11(1 I.OCII~CC t . 1 1 ~ r l l n n i ~ ~ g (.i~iic OC U ~ ~ O I W I I ' S i11gOrith111 10

O ( n 3 is 1l1o fact t l ~ i ~ t it ' we rcc:ortl ill1 thc? c41;irac.tcw 011 t11c etigcs of 1 . 1 1 ~ t r c x ~ , t11c i~lg:)ri t , l l~~~

will rt:cluirc, T)(ri.') space and tIicrc!fore O(ir) r l m n i ~ ~ g time will 11ot I)e ac:llirvirblc. To over-

csoi~~c. this l)rol)1(:111, i11stt;d of roc:ortling c:l~ar;~c:ters, A:(: la1x:l tllo ctlgils l)y ;t pair oS i~itlicc's

i ( l (~~~t , i fy ing f.he st.irrt and (211d irr(liccs of t h c s~~I ) s t , r i r~g on that. edge. This wily, oldy 1n.o

1 1 1 1 l l l t W ~ ~ ari! writ t.(lll 011 ibllY (:(lg(! illl(1 S ~ I I C C : t,ll(! 1 1 1 1 1 1 1 ~ ) ( ' ~ o f C ( ~ ~ C S is ilf. 111ost. 271. - 1, t,llc: t,r(Y1

will o~ i ly IISP ~ ( I I ) S ~ I U : .

Observation 1: 111 ally l)hi~s(?, if sufIis c x t e l ~ s i o ~ ~ rul(: 3 ap1)lic:s ill cxt .cus io~~ j : it will

i l I ~ i 3 i1.pl)ly i l l f'llrt,Il~r (~xt ,e~~sioi is 1111t,iI t.11(: (? l l ( l of t,lli~t 1)IliI~i'. 'rll(? ri!;ls011 is t,lli~t, \V~IC! I I 1'111(!

3 i~.ppli(!~, t.11(: l)iltl~ Iiil~clntl S[j . . i ] i l l tlre currt:~lt. trcc ~rrust, c :o~~ t i~~uc : with cl~ar;tc:tcr S[.i + 1 1

i l l l t l SO 1.110 pi~f~11 181)(:1t!d c?[,j + L..i,] (Io( 's i l I ~ 0 , i l l l ( l I'll!(: :: ilgilil~ a ~ ) l ) l i ( ' ~ i l l t . 1 1 ~ lI('S1. (~xt.(!l~siolls.

It is a.lso I)cric?fic~i;d to ol)sc?rvc: that. a ~lc?n. s u f h l i d t ~ l w c l s t.o l)c ntltlctl to I.l~c t.rcc: o~ i ly idtor

it11 c s t c ~ ~ s i o ~ ~ i l l ml~ic:l~ c s t c ~ l s i o ~ ~ rulc 2 a.pplic:s. Now we ~:;III stat,(! t.110 I I ~ X L t~.ic:lt.

T c c h n i q u c 2: K ~ t l ally plinsci i + 1 tllo first I,i111r Li~i~t, cxf.c~~ision r111c 3 iq)l)lios. If Illis

I ~ a p l ) o ~ ~ s i l l ostmsioii j : t11c11 111(:re is 110 ~lectl to t:xplir:it.ly h l 1.11~: e ~ ~ ( l of' ally s t r i ~ ~ g S[X:..i]

I'OI' k > , j . id1(: ( : i l l ] t l l(! (~xt,cllsiolls i l l l ) l l i l ~ ( ' i+l I , I I i l [ , ill'(' (1011(: ;lfk!l' t , l lC [irst ~ ! s ~ Y ~ ~ l L i o l ~ of rIl](!

3, . i 7 ~ 1 p l j ( : % t (-*x1~11sioris.

Observation 2: I f ;it so111c: poi l~t i l l Ult l to~~c.~~'s illp0rit11111 i~ 10i1.f is ~~.(:ilt~(:(l i111(l li\l)c:lctl ,j

(for 1 lie s11Uis st , ;~rt , i~ig ilt, p o s i f h ~ i , j of S ) : t , l i c~~ t,lli~f., lvaf will rcmri11 ;I h f i l l i l l1 s~~(xx!ssi\:c

Lrccs c:rc:atetl (Iuriug the illg0~.itl1111. TIIC r c a s o ~ ~ is t l~ i i t t l ~ e r ~ is 110 I I I C ( : ~ I H I ~ I I I ill t.11~ algo1.itl1111

I;o c ~ s t , o ~ ~ t l ;I. Ical' c~lgc? l)c~,yi)l~il il-s c,rirrc?~it, I w S , i.c wl i r~i ;I IcilT is Ii~bcl(:tl , j . c!sl- i!~~sio~~ nil(> 1

Wil l ill\Vnys ii])[)IJ' (,o (Xk!rl~i(>II ill illly SIIC(~C~SS~V(! pllasc!.

I,ct j , tl(:~~otc, the lilst, extc~i~s io~i i l l this scq~lc~~c:c~. Now \vo (::III proscut t l ~ c last trick.

Technique 3: In ~ ) l~ i i sc i + 1: wlioli u Ioirf ctlgc' is first croi~t.c!tl and n,ol~l(l 11or111;111y I)('

1al)cletl wit11 sul)stxi~ig S[l)..i + 11, i~~st,c!ad of writ i l~g i~~tliccbs ( p , [+ 1) O I I the ctlge, writ(: ( p , o ) ,

w l ~ ~ r c (, is s,v~nl)ol tlci~ot,ing "t,Ilo crirrcwt cr~tl". Sy11il)ol P is $1. glob;~l illtlex t h t is sct, tao

i + 1 once in c!acl~ phasr. 111 phas(-: i + 1: sin(:(: tho idgoritlm k~lows I llitt riilo 1 will i11)pIy in

c:sto~rsio~ls 1 L11l.ol1gl1 , j , i ~ t l(\i~st,, it I ICC(I (10 110 il(l(litio11i11 cxl)li(:il ivork t o i~~l l ) lc i i~( :~r l t.I~os(~

j ; cst,c?~lsio~~s. Inst-xwl, it. only tlocs c o ~ ~ s t a ~ ~ l , work t.o ~ I I ( :~CI I ICI I~ , v i ~ r i i ~ I ) I ~ C! ~ I I I ( I t h 1 (!OW

oq)lic:it, work I'or (so111~) c:xtcusio~~s st nrt.ii~g wit11 c:xt,clisior~ , j l + 1.

L s i ~ i g tc>c:l~niq~~c?s 2 i ~ l d 3, csplicit, c~st ,m~sio~ls in plii~sc i + 1 using 2 iLrc. o ~ ~ l y r c q ~ ~ i r c d

Page 47: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

A l g o r i t h m 3 Si~igl(, I'hilscl Algoritlm (SI'A)

I : Inci~oii~ciil, iiitlw e t.o i + 1 (By X?c:hniq~ic 3, t l ~ i s correctly irnplctncnts all ill~l)licit. c.xtc:~~sions I t,hroug11 j , ) .

2: Explicitly coull)utt succcssivc cxtciisior~s (usiug i~ lgor i t l l l~~ 2) ~t i~rf . i l lg a t j , + 1 u~it i l r c x - l i i ~ ~ g tlic first, oxl.c~isioii J ~vlierc r~ i l c 3 applies or until 2111 (xtc~iisio~is ilrc ( I O I L C in this l)ll;~so (13y 7'ric:lc 2, t liis c~orroctly i ~ n l ) l ~ w ~ c i ~ t , s ill1 ( . I N > :~tl( l i t io~~ill iiril)lit,it, cstcwsiolis :j + 1 t , l ~ r o ~ ~ g l i .i + 1).

:%: To prcparc for thc nest step, set; 3, + 1 to j - 1.

Thcorci r r 2: Usiug s u l h l i ~ ~ k s a i d triclts 1, 2 a ~ d 3, IJkkoi~c!~i's ;~lgorit,luii 1)liiltls i~~~l ) l i ( : i t

sldfis t,rc!c!s Il t.llro11g11 I,, i l l O(n.) ti111o.

4.2 Fincling Maxilnum- Weight Anchor Set

C4ivc1i u set of ~ n a t c : h s , we would liltc t,o scloct sct of 11011-crossing niatcklc:~, also r(\forrcd L o

13i1s(~l on t11is d e f i ~ ~ i t i o ~ ~ , a givc11 111;1t(:h :\I = (.il, i2.11,!2) c o r r c s p o ~ i d i ~ ~ g to a ~ l ~ l i ~ . t . ( ~ k ~

h?t\\.e(?ll Sl[bl..i2] and S2[ll..12], dofillf!~ poillts Of il ~.(Y:ti~llgIC 7. in t h ~ (.ill.l.(&Ul [)Iilll(' with

f IN: I)ot.t.o~~i Icft. c o r ~ ~ c r I ) c i ~ ~ g thc p o i ~ ~ t . (61, L 1 ) i111tl I.lie top rigl~t, c o r ~ ~ c r l)c!i~~g ( i2 , 12). Ilcfino

thch wc+ght, of r. to 1w its tlrca. T l ~ c ~ ) r o l h l ~ ~ of fincling i~ scl. of 11lnsi111n.l ~rintc:hos, rc(ll~ccs to

filldillg i L s(!t, 0l' llO~l-(:~O~~illg' ~~Y:l.illlglW Wit.11 l l l i l ~ i l l l l l l l l ilrCil COV(!l'HpC. L;IOl'C! fO~lllill1)':

Page 48: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Dcfinition Let.

I)LLS(?(I 011 t11cir XII IML:~ itii(1 \v(, s\v~c!1) I , I ~ O I I I f'ro~ri right t o lcff,. At, (>il,t:I~ st,:i.gt?; wl1(:11 \w are

1)roc:ossing rc>c:t;uiglc i, wc! ivarit to 1i11k it to t,llc I I I ~ I X ~ ~ ~ ~ I I I I weigl~t pi~tll i l l t h ilil.(:rviil

[ S I I , ~ : I : [ ~ ] , +x] x [Y~rma:[i], +DL] iintl storc i t i l l ii sot I I . \\'o [lsc1 Nc:i : t [ i ] to clor~oto t,licl ~icst

rcc:t.iu~glt: i ~ i tlic I)ost patli c o n t a i ~ ~ i ~ ~ g roct~a~~glo i ( t l~o piit11 witli 11iiisi11lu111 iveigl~t.).

Tllc! O ~ ) ( ~ ~ ~ l t ~ i O l l ~ l l ( ? ( ! ( ~ ( Y ~ t,0 ~)Ilild iLl l (1 lll~illipllh~:(~ sel, D i l W U [ I ( / ( L ~ C illl(1 DC5b. TllP OI)('riL-

t io~i Besb(D, y) rct.urus Mccttl( i) of a rcctt~ngle with ~ ~ ~ i ~ ~ i l n u l i i Y.tr~i.r~[i] 2 and ret11r11s O i f

I I O s rd i rc:c%ii.~iglc i exists. TJ~)tlatr:(L?, I ) ~~l)tlatcls t,lic sc:t I1 as I'ollows: It ii(lt1s 1 . 1 1 ~ Iwsl pa(,Ii

startir~g at; rwtat lgl~ i to D, hit. i t pr(!s(:rv(:s t l ~ c (:o~l~l)at.ibilit,y alliol~y II I ( : I I I~)C~S of D. 111

otlic:r worcls. Sor ally I\vo l)ii.t,lls p i i ~ i c l (1 swli t,li;it. J I 5 (1 i l l set. 11, wc: wol~l(l liko lo r o ~ ~ i o \ ~ c 1).

Page 49: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

A l g o r i t h m 4 k' incli~~r klaxiniun~ \.\'cirlit Anc~i~ors

Now iv(! (.a11 I ' o r ~ ~ ; ~ l l y (l(:sc:silw t l ~ i s prowss i l l Algorit,l1111 4.

111 ortlor 1.0 i~c:liic\~! ;L ~ I I I ~ I I ~ I ~ ~ t i m : of 0 ( h g 1t), wo 1i(wI 1.0 cllic:icwl.ly i ~ ~ ~ p l ( : n ~ ~ l ~ t . sc1. 11 so

t-liat t 1 1 ~ Br:,s/. a l~ t l b'ptLrl.l,c opcnitiolis I.i~kc O(1og11) tillw. Tl~i~t, 11lci1ns \\re ~ ~ c ~ t l t,o Oc i ~ l ) l ( t

t,0 ~(~tll'(:Il, illscl't, joill allti split i l l C)(lOg 1 1 . ) tilll(:. \AT(! (:ill1 i l ~ ' ~ . O l l l ~ ) ~ i ~ ~ l tIlilt, 11sillg k i l I ) i l I i l l I (Y!( I

~ ( ' i ~ ~ c ' h t rw . S i ~ m cwc:h pal.11 i can bo u n i q ~ w l j ~ ic-lo~~tifiotl by i1.s Hmrl(i.) ( ~ L I I ~ t . 1 ~ rwt o f t . 1 1 ~

path (:;u1 bc const~ruct,cvl u s h g .Vc..ct[l] poi~lters) , cacll clenwiit i l l D (:iui I ) ( : sc~pr.c~srlit,ccl ly

; I 11111111)(~ w l~ id l is 1 1 1 ~ s~( : l .a l lg l~ n111riO~r 1 . l l i 1 ~ i l . ( Y ) S ~ ( ~ S ~ ) O ~ ~ ( I S lo. Eac:h r~l(:nic~l~l, of D Iias ;L

To/nll,Vci</lrt i111d a Ylttil?. ~ s s o c i i ~ t d wi1.h it. HOW~VCI. since all c!le~~~o~-it,s of 11 ill.'! n~ut.ually

cxm~pi~t . i lh i i l ~ t l Oc:c:al~se i l l t l ~ o for lool), wc: SGIII 1 . 1 1 ~ c1(:111o11t,s I);isetl 011 tl~c:ir .Y-c:oortli~~i~t,c~s

fro111 riglit, t,c) l ~ f t , , t,11c ou lw of i~scel~(lillg Tot.~~~lIV~~~t.,ql~t. is 1 , l l ~ silllic as t,Iw o r ( h of tl(w(~n(li11g

Yinirt . I'liorcforc: although nlc Ii;~vc t,wo k ~ y s associntc:tl nriLll tach clcmcnl, (i.c ?'o/.crlWr:~i~ylrL

;11i(I Yti~in)! if \vv sorl, t.11~ ~ ! I ( W I O I I ~ S i ll the L r o ~ OII OIW lwy~ 111ey will I ) ( ! sor~.(xl on t.11~ o t h

lwy i l l 1 . 1 ~ o1)positc ortlor.

Page 50: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

GAUNA

In this c:l~apt,c:r wc will tlescrilx GAUNA (C:lul)al A l igm~~(mt Using ~ 'o I I - cx~~( ' [ , Al1~1m.s)

( l c l o j l i t S i I S J i i v i I l i e I l i o l ~ l ~ l l i s l ~ i t . I . 141) i l . t ~ t l sliow its

~)(:r l ;)r~~~ar~c-:(! c~oiu~);irc?cl t,o otlwr stat(.-of-t.hc!-i~rt, i~lgorit 11l11s. 111 ortlrr t.o (lo t l~ i s , n.(, will licit.

givc ;I Iligh-1ovc:l ovcrvicw of 11on: GrI l JNA works i ~ ~ i t l t l ~ ~ l l n ~ ! will ( ~ s ~ I w I I I I :V(>~JJ 1)il l . t) of' t l ~ c

algori~,lut~ in 111orc (Ict,ilil. GAIJNA is t l ~ r 1)asis for I(:I\UNI\ i111t1 t l~crcforc: r~~~il(!rst.;llitli~ig

5.1 Met hod description

G A U X A is i t rcutrsi\v a l g o r i t h ~ 11asc:tl 011 thca lollowing t , l ~ l w l r ~ a i r ~ s t ~ p h :

Page 51: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Algorithm 5 G A U S A H i ~ h Lewl

Page 52: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

5.2 Finding Maxiilia1 Inexact Matches

~ i v ( ? l l t,\\W s(!(!U(!llc(!S ~ 1 . 1 1 ~ 1 .s2: ?I, ([llil(I~ll~)I(! ( i 1 , .i2. l l 1 2 ) is ('iIII<!(I ki rrt.(lk:/t . i f t.Il(! ol)l illli11

i ~ l i g l l ~ ~ ~ ( \ l ~ t s(:or(! o f t,11c t.\vo S I ~ ~ S ~ ( I I I ~ I I ( : ( ! SI [I1! i l + - I] i111tl S2[i2, i:! + 12 - I] is grcJiatc:r

tllilll 01. (YlllilI t0 W ~ ~ l . t ~ i l i l l tlll.(!~llOld. NO((! t.llilL i l l tile ~(X~llOl S [ i , j ] t lC l lO t ( !~ tllc ~ l l b~ ( ! c lU( ! l l ( ' ( ~

of 1 . 1 1 ~ sc!qiIcnc:c S sl ;~rt.il~g i ~ t p o s i l h i i i u ~ d cllding iil, 1)ositiol~ j . If Sl [ i l : i l + 11 - L ] =

S2[i2, i.2 + 1.2 - I! t.hc 111at~c:h is c:allecl an c m ~ : ~ ~ ~ i l k l i . , o t i ~ ~ r w i s c it is cidlocl a11 iiwxnc.1 711.(~t(,l~.

I;i)llowing t . 1 ~ t lc l i~l i t ion ol' Dc1cllc:r ct a]. , a 1uatc11 ( i l , 1 2 ! l l . 12) is c:all(d Y I L . U . X ~ ~ I L ~ ~ il' i t (.i111110t.

be cxtcwclctl at ctit.hw cwtlpoint [13]. For irlcxict, i ~ l ~ c h o r s wo p l c w l i x c t l ~ i s t l c f i l ~ i t i o ~ ~ 21s

f'oIloi\:~: AIL cxi~ct. I ~ I L L ~ . ~ I ( i l ! i2! 1 1 . 1 2 ) is I I L . K L L / I ~ . ~ if' t11erc is 110 o t l ~ ( ~ r 111ill ( i i . i h , 1; 1,;) stl(,l~

tlial. Sl [ i l . i l + l1 - 11 is il 1)ropc:r S I I ~ ~ S N ~ I ~ ( ! ~ I ( Y ! of Si [i', , b', + I', - 11 i111d S21i2, i2 + l2 - 11 is

21 l)ropcr S L ~ I M Y ~ ~ I ( ! ~ I C ~ o f S2[i$;i; + l b - 11. Wc w i l l 011ly (m~si( lcr i11wact ~ n i ~ t ( ~ l ~ c s for w11icl1

l 1 = l 2 ~ I I I ( I t , l ~ c l ~ f o ~ 011r l l l i ~ t c l ~ ~ s will I>(! S C ~ I ~ ( W ~ ~ I I , N I l)y il t,riple ( i l . i ,2 , 1 ) .

11s i l l ot,Il(!r ~ 1 c l 1 0 r I ) ~ I s ~ ( I ~ ~ I P ~ ~ I I O ( I S ! G A U K i i LISPS st~lt ix ~,NY!S to lilul : I I ~ ( ~ ~ I O W . 111 Scctio11

4.1 n.e Ila\:o tl(w:ril)c:tl i l l tlctail I ~ o w t.o 1)uiltl ;I suftis tree i l l lil~ctar L i l l x i111cl Sl)il(.('.

For ;I S ~ Y ~ I W U W S, t,llc: snlic,lll-, fcat.11l.c: we ~lc:ctl of a slllfix t r w for S is t,l~;lt, t,llc> ~.OII( , ;II ;(~-

11a1io11 of odgr-lnl)cls 011 t,hc ~ ) i ~ t , l l fro111 the: root t,o i111 i n t e r ~ ~ i l l ~ lo t lc is n ropci~t sul)scqt~c~lc:c~

i l l S \\:11(w tllc: 11u11l1)c.r of ~ , c l ) c ~ ~ t s c:orrc:sl)ol~cli~lg to all i ~ ~ t c w ~ a l ~ ~ o t l ( . is ( y 1 1 i 1 1 t,o t,llc I I I I ~ I I I ) ( T

o f I ( J ~ I v ( ' s of t,llc s111,t~rcc rootc:tl a1 L l ~ l t il~l.c:n~nl ~ ~ o t l ( : .

LC:(. SI i111d '5'2 IJC! t8 l lo ~, \vo il11)11t S(!CILI( : I I~~S to 0111. ;~lgori t , l l~n o f Ic11gf.11 / I / i111(1 71, ~ q w ~ t . i v ( ~ l , y .

\?To biiil(l i t st~ffis t,rc?c! for S1 :111(1 ~ , I I C I I s o ;~ rc l~ f;)r s~ l l ) s c~~ i~c~~~c :c s o f S:, o v ~ r t.his s~~Ili.u t,rocl. \\I('

\\'is11 t.0 f i l l t l ill1 lliil~illlill llli3I;CIl(!~ 0f illl(l & t.llat Ilil\r(? ~lifflllll(!ll~; S(:01.(! H ~ O V C il. t.lll.(~S110I(1

s . Tllc 11i1ivo ~nc:t,hotl is to h l . for onc:l~ slll-)scqi~cnc.c "ii o f thsc: pilths of 1 . l ~ si~tIix tr(:(,

st ,art , i~~g 21t. t,he root wliose Iiilx:l, ~niit,checl with SI,, has an n l i g ~ ~ n ~ o l ~ t scorc grcutvr t11al1 s .

Not,ic:c: t,l~c:ro (:i111 I)c ;I lilrgc: 11~1111)0r o f SU(:II 1)il.Lll~ a l t l cvc:l~ t,llc! 111ost d f ic ic~~t , i1lpprit.11111~

k l l o rn l~ fi)r t l~is ~)roblcrr~ have vc?ry 11igI1 ri l l~~li l lg 1.imc il11(1 Sl)il(.(! L . ( Y ~ ~ I ~ L . ( W ~ ( W ( ~ makillg ( . I I (> I I I

ill~lxiict.ic:i~l [:JOj. r 1 l o ovcrconlc I his prol)lcn~, no 11111~1 co11si(lvr t1w s ~ . ~ I I c ( . I ~ ~ ( : oE i ~ ~ p i l . S ( : ( ~ ~ I C I I C ( ~ S . \<\;II(:II

t11c i11p11t s(yi1(!11(:(:s iir(! w r y s i ~ ~ ~ i l a r , t,l~c:rc call I)(: 111) to O ( r r ~ 1 i . ) ~rluirriiil 111~1t.cl1cs ( W I I U I wt!

II;I\T s11ol.t ~ l ~ i l t c l ~ o s i l l o11c soquc1lc:e r(:l)ei~t(-?tl I I I ~ I ~ I Y ti111cs i n t . 1 1 ~ 0t.11cr S(~U(:II(:(:S) while 1 . 1 1 ~

null~lwr o f I . 1 1 ~ irrdlors is at rrlost O(lnin{//~,, 11.)) (since i~llc:llors arc ~lol~-c:rossillg ~llat,c:I~c's, 1,11('

11111i11)(~1. Of illl(:1101.~ ilt. I I I D S ~ . ~([llill 1.0 t , i l ~ 11111~1~)(!1. Of' ~:llill.il(:t.(!l'~ i l l 1 ~ I c ~1101'L(!~t. S(YILI( 'IIC:(~).

111 t . l l is rxso I I I O S ~ , of t l w n ~ i ~ x i l i ~ i d l ~ i ~ t ( h ( , s wo (lisc:~r~l(xl (111rillg t11c ~ . I I ( , I I O S s(~Iw1,ioll ~ ) I I ~ . W .

Page 53: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

T h r ~ s i t s1i1iicc.s t,o l i i i t l ;I slili~ll i i ~ ~ r ~ i l x r of lil;~silrral riik~[.c.hc:s; wc: c:licosc! tlloscl \\;it11 Irigl~c>st,

scort:. 0 1 1 t,hc ot.llcr Ilalid if t.hc input sc?qucnt:c~s arc? tlissilnilar, thc 11unil)or of ~riilxilnal

III~L~(:II(?S is I ~ S I I ~ I ~ ~ J ~ CIOSC t,o t,lw nri~nlwr of f i l d i1.11~1iors i l l wllic11 case orir algorithrri will lint1

~liosl; of tlw rriasii~lal ~i~at,c:licxs.

Dofilio 1 1 1 ~ siriiil;\rity value: I)c:twow t,wo sr~l)scqrlcxic:c? rY i i l ~ t l Y . tS(X, Y ) : ;is tl~r: score, ol

1 1 1 ~ ol)Lilri;ll i i l i g~ i~ i i~ i i t tlivitlctl by t,lic leiigt.li of tJic s~il)scclrlc:licc~ (wc o111y ca~lsitlor ini~t(:1ics

with S I I ~ ) S C ( ~ ~ I I ~ : I I C C of the silni(l 1~11gth).

Our s~iffis tree ~ ( ~ i i r d ~ iilg~rilhili works ;IS follows. Lot 7' I ) ( ? t l l ~ siifiis t,rcc. of SI . A

locatioii i l l 'I' is c:itllcb~. ;L iiotlc of T or a l)oi~lt, 011 all c:tlgc of T tlli~t: splil,s tIw lal)(:l o f t.hc

cdgc intm t,wo s:il)soq~~rnc:c:s. For ci~c:h sulfix S$ of S2, wo ti ld iill locatiolrs 11 ill T sricli t.liiit,

tl~cb Iiil,rl o f tlio pat , l~ fro111 t,l1c: l,oot, of ?' 1.0 p Iias a Iiigh si~iiilwrit,y valuc: \\.it11 solric: ~)rc!iis of

S;. Algorit.hm 6 ricpid.s our sill-lix 1.rc.c scil.rc:li mc:t.llorl for a S I I ~ S; of &. Lct. S; = .S2[r1, rr] I)e il sul-Iis of Sz i ~ l i ( 1 l ~ t P 1 ) ~ tlic set of Iocat,iolis in 7' rct~irli(:(l 1))'

aIgorit,lilii 2 Sor S.1. \Vv ~ i o w filitl t l i ~ sc.1, of iricsact 11iatc41c:s I)ctn.coil tllc: 1)rctixc.s of S; al~t l

s~ll)sc?cl~~(~nt:(~s of Sl. I+r ~il(.ll 1~)(.ilt.i(.)11 71; ill P. 1 ~ t I)(! I.Iit-: l a i d OF the. 1)ilt.li frow root ~f

7 ' to 7);. L(8t R , I) ( ! t l ~ c sot of' oc~~r l , c : l l c . c~ of Y , i l l .Sl . Not.ic:o I Iiat oc:c~irl.c>licc~s ol' 1: i r i Sl

t:orrcspo~~rl Lo t l ~ c Ii lhls of lhc pill.11~ from t.11~ root. 1.0 p; t.o ;I lcilf of t h s~~ l ) l . r c~! mol,c!tl ill.

pi. Tllerof'orc! I?; (:art be com1)utc:tl efIicic:~~t,ly by travcwing t,he s~rl)t,rcc! of T root,ctl a t pi.

Orlc(> tlic sct R, is corlil)rlt.c:el, for cad1 sr~l)scclwlrc:c\ Sl[r.,. 7;. + (I:( - 11 i l l R, tJio i i l g o ~ ~ i t l ~ i ~ ~

o i ~ t p u t s t . h (5, .TI, IY, I) ;IS ill1 ir~txilct, 111atc11 (SW Figure 5.2) .

This p r 0 ~ : ~ ( 1 ~ 1 r c sig~iiiica~it,ly rcx111ccs s(!1~(4i t,iliic l),y l)ruiii~ig tlic s(:arcli spti.(:t!. Ilo\v~:vcr,

c:oi1i1)rit,irig t h o1)l,ii11;il idigrlrlcnt I)c~.\w(:II the: Y, i~ird SG 11si11g N(:(!(ll1iiil11-~4~1111~(:11 r(!quirt!s

qiiatlr;l.tic tillie a1ic.l t.oo slow for our 1)urposcs. To ovcl.colilc this, t h c t l y i a ~ r ~ i c progr~l i i~~ui l ig

is lilrrit,(xl LO il l,i111d of widtlr 11 i i . ~ ~ l i l ~ d L l i ~ l i l i i i l l (liiigo~li~l (Fig~rrc 5 . I ) L l i ~ rc111r(:ing r~lririilig

time to O(tllY,I). Not,ic.c t h t iulcllors found tliis way do )lot c.olita.ill lolig S(I~II(:II(:CS of i l d ~ l s ;

lolrg scclric:ric:c?s of ilitlcls i t r t . 1 ~ c~orrsc:rvcd rc!gio~~s will orrljr I,(: tlet,ec:t,c:cl nl11c:rl wc! c:losc. gal)s

I)~t\v(!(:rr the ;in(-Irors. ~\lgorithlri 7 (It!s('rilxs ~ I I C 1 ~ 1 i i ~ i 1 1 1 d ~~~ritt~lr-fintlirrg rnot,hotl.

A f h linclillg i ~ l e x x t n~atclres, tile algorit.hl11 nost i(hltiiic!s l l l ~ ~ i l l l d 111alch~s. li) (10

so, wc: sort t,lrc rllatc,lic~s wit,lr ~,c:sl)oct to t,lic'ir loc:atio~~s i r i o~rv of t,lic sc:cl~rcricos. thtoct 1 . 1 1 ~

~lon-nla.xilllal ~~iat.c.lics. RIKI ~ ( ? I I I O V ( ~ t,lr(w~ fro111 tile SO^ of I I I ; I ~ C ~ I ( I S .

Page 54: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Algorithm G Swrr l i i l~a 0 1 1 'I'll(, S~lffis 'I'rrc'

Page 55: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Fig~irv 3.1: To 11ii11w t8Iw Dl-) ro11ti11c f'ast,cr, o111y t11c iiwa ~ I I O W I I will Iw ~ ~ o i ~ ~ r ~ x l , Tllis r (w~l t , s is sliortcr sccplcwccw ol' ilitlcls i l l tlrc ~nat.chcs

Algorithm 7 l:ilitlinr$ h'liisi~nal Mat.c:hcs

Page 56: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

CHAPTER 5 . G,4 IJNA

'I-

Page 57: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

5.3 Selectsing Anchors

5.3.1 Filldi~lg Lnrgcst Total Wcight Non-crossing Anchors

Fig11rc 5.3: A sct oi' 1111c11ors is (lq)ict(!d. 1~cct;111glcs ~ O ~ ) I . C S ( : I I ~ t h : 111axi1ui11 t ~ ~ i ~ t , c l t ( ~ s a11(1 ;I

sot of gootl at~cllors is tl(~l)ic.tcd i l l \vltit,c rc:c:t.i~~~gl(,s c:ott~~cctc!tl 1)y t l ; ~ s l l c ~ l lillcs.

Page 58: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

5.4 Closing The Gaps

5.5 GAUNA Parameters

11. \Vi t l t l l (1, of t,ll(. I)nlltl n r o l l ~ l t l t . 1 1 ~ l l l n i l l (liilg~llill O f the (lylalllic ~) l '~g~i~l l l l l l i l lg tiil)1(!.

To aligl~ t11o s~il)sccpcl~c:cs ol' S2 wit11 t l ~ : p t l i 1iil)c:ls of' tlic suffix tl.cv, tllc suffix t l w

sc:;~,c:ll nlgorit,l~ln rc.st,ric:ts t l i c (I>r~~ill l~i( . p r ~ g r i ~ l l ~ ~ ~ l i r i g t:11)1~ t,o :i 1.);111(1 of width IJ ~ I Y ) I I I I ( !

t , l~( ; 111i1i11 (liago~iid.

Page 59: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

GAUNA Results

6.1 Exact vs. Inexact Anchors, Spccificity Evaluation

6.2 GAUNA Parameter Settings and Results

Page 60: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

T a l ~ l r 6.1: GALJNA Spcc.ificity

S e c l w ~ ~ w s i\~lc.l~ors Specificity

co~~si ( l (wxl in 0111' I ( % L prorc!ss ar(: i ~ \ ~ i l i l ; ~ l ) l ( ! t ~ t . ~1f,l,~~://\\~\\:\\~.~>i111~.111;1t,~l.(:i1/gil1111~/.

For (:01111)itriso11 1)11rpos(!s all p r o g r t u ~ ~ s \v~r(! ~ I I I I 011 tl I h u x I I I ~ I ( ~ I ~ I I ( ~ \vith tl. 3.4 GI1

iutc4(R) Xcto~l('l'ibI) pl,oc:clssor ;m1 2 GI3 of RAM.

\V(+ 1 1 0 ~ ~ : t11;~t for illig11111(?1.1( of ~ I I I I ~ F L I I and (:1ii111p, G A U N A (:IIos(' ~ ~ i \ ( : t i l ~ ~ ( . h o r ~ ilS ~ I I ( : s ( >

c:o\~!rcd niorn t i i i l ~ ~ 50% of tliv SI-:(LII(:IIC(~S. 111 tllc ~ t h r C:ilSCx t,he 1):~ranic.l.cr s r t k i ~ ~ g s IVPIY!:

I( = {25, 10. 7): E = 1500', (1 = O.S> m c l .I,: = 7 (a clcwriptiol~ of ci~c:]~ par;u~lctor ( X I I)(!

fo1111c1 in S o c t i o ~ ~ 5 . 5 ) .

Page 61: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

011r c:ol~~l)ariso~is of CAUNA, LAGAN: AVlL), h l U h [ ~ ~ ~ o r , ;ml MC;A arc: s ~ ~ ~ r ~ n l a r i z t x l in

'ri\l)l(: 6.2. CVo tclste(l f . 1 1 ~ o t l m tools ( n ~ h c ~ o ~)ossil)lc), to [ i ~ ~ c l t . 1 1 ~ parnrnc:tc:rs t , l~ i~t 111ilsi1nizc

tllcir pe~rfor~~~alic:('.

'li, Illcli~srlrc? t,llr! fl1l;llit.y 01' tllc! iihglllll~llt.~, lillcl 1 I IC illiglllll(:ll(. r(:giolls (llilt l i i l ~ i ' il

high ~ L ~ ~ ~ I I I I I ~ : I I ~ X . O ~ C :111i.1 C:OVPI. II IOY( ' t l ~ i ~ i 10% ~f iL l l ( 'soI~. TIIO t ~ t , i ~ l 101lgtI1 of ~ 1 1 < , 1 1 wgioi~s

cl(!tcrilii~ic~s t,lic\ cp ;~ l i t ,~ . of ail i \ l ig~l~~i( ' i i t i ~ l i ( l is sIiow11 I I I I ( ~ ( : ~ t110 C : ~ o ~ : r q j c ( : O ~ U I I I I ~ i l l ' l ' i l l ) l (~

6.2,

'li) l ~ i i ~ ~ ( l l ( : r o g i o ~ ~ s of (liffor~v~t. I c ' I I ~ ~ ~ I , \\T (Iof ~ I I ! t h ~~orli~iilin:cl s w r c for il l1 a l ig~l~ucut

region t,o bo $, wl~orc? s is tlic? score o f t,hc illig11111(:11t,, b is t,hc 1~1igt,h o f tllc i ~ l i g ~ ~ m ( : ~ ~ t , i l ~ ~ ( l P

is thc ~nasimrun value ill tho scoril~g 111at1.i~ (note that, nor~nalizc!d score is alwilys less t h ~

1). A 1.cgio11 t . l l i l t I~ils a 1lor111alizcc.l st:oi.cl al)o\.o 0.8 is co~~sitl(:rc>tl as a lligll i ~ l i g ~ i ~ ~ i ( ' ~ l t s(:or(>

Page 62: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

29:%37=17 CiAUKA 1 l(i27Ci5 LAGAS

AVID IIlJhIillor Tvl G A

'lYw r(~111ts of global d i g n t ~ w ~ ~ t ~ s for (Iiff(w11t 1)rograms. Tinw is iu sc(:or~(ls i\11d 111(~1ory is in III(!~~II)J~~.c:s.

Page 63: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

7.1 Measuring An Alignment

ils \IT lalorn !lot c:fIicir~~t,) t,o alqdy, for t,lic. (:i\scl o f I>iologi(:i~I soq~lc!~~c:os, it ilocs 11ot. provicle

11s \Vif 11 11111(~11 illf01.111ilf i011 ilI)~llfd t!lO illig'lllll~llt,. SO WC 1 1 ( ? ~ d 11101.(, sOl)llisti(:il(.(!(I \V;lys Of

Page 64: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

tirfi~lccl a c:ol~se~,vr(l r e g i o ~ ~ to I) ( . all i~lip,l~lll(:l~t region t , h t 1~1s a h i g l ~ a l i g ~ ~ l ~ ~ c n t scol.c: i111(.l

covcw I I I O ~ C t1li111 10% of an c : so l~ . Tllc rc?i~sol~ w l ~ y csolls are i ~ ~ ~ p ~ ) r t . ; i ~ i t : is L11iit fro111 t h :

SOLLII(I t ,I l :~t, (lo ]lot, ov(:rl;tl) \vit,l~ i1llJl CYWII , \w t11i11k t,lt(!,y 111ig11t, still I ) ( > o f S O I ~ I C i l ~ ~ ~ ) o r t a ~ t ( x ! . , 7 1 IIC 1)ioIogicd ~ i i o l i v i ~ l k ) ~ ~ l w l ~ i ~ i d 1.11~ i r ~ ~ p o r t i i ~ ~ c c o f ct~11scrvcc1 r (y$o~~s (:i11(1 tlicrdorc (XOIIS)

is illkit i C i l 1,(:gio11 11ils ( : I I ~ I I I ~ ( Y ~ litt,lc over t,i111(:, i t IIIIIS(., I ) ( > rcsist.i1.111, t.o 11111t,;llio11s i111(l so

tllrrc is R gootl rl~allcc I.hal, if. has t m n of sonic! import i111cc lor the lifi: o S L I I C spc.cic.s.

'I'hcroforc?, i l l IGAIJNA 1)csitlcs looliil~g a t thc c:or~sc:rvod rcgio~ls ovorlappily, esolls, wc!

il.ls0 look ;it t-llc OWl'idl (:OVCrilg(! of cOllScI'Vt!tl I'(!giollS il.lld 0111. 111~il~111.~1ll0llki ill.(! I)a.S(!d 011

I)ot,ll t , l l c w fac:t,ors.

7.2 Improvements to GAUNA

Page 65: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

7.2.2 Branching

Page 66: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 67: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

1: ~ ~ ~ O I I L ~ S ~ = C ; ~ : I I ~ S ( ~ ~ I I L (S2) {fil~rlir~g t11v pot,c,l~t,i:il scst o f esons i l l S 2 ) 2: lCiAUNl\(se(ll~c>r~(,(> '51; s ( ! t [ ~ ~ t ~ ~ ~ c e &, ;~rray I<[l , . . i~l(lt>x i . 11iatd1%1 111, 11o(lt: s11fhx-

r 7 Irtx~Rool., c:sol~s c ~ s o ~ ~ l i s l ) {I<[ l . . . 11: ;I list of l(mgt,l~ ~~lir~:sl iol~ls for a ~ i ( , l ~ o r s i l l (lifI'(w~tit r t , c~~rs io~ i s : I<[l] ;> K[2] -, . . . > I<[[] } { i : t,lw c.11rrr11t i ~ ~ t l c x for I<}

:$: if ( ~ S l ~ ~ S 2 ~ ) is s ~ ~ k : i t : ~ ~ t l , y s111;111 the11 /I: {c:so~lList is t . 1 1 ~ sct of poter~~ttul c w ~ ~ ~ s } ,;: Aligrl Sl i111t1 5'2 11si11g h~cc~llc111:111-\4~~111scl1 aIgorit11111. i : rcturli 7: crld if 8 : i f i > 1 then 1 : rc turr~

{L(YI\Y! ,Yl : I I I ( I S2 ~ ~ ~ ~ : i l i g ~ ~ o ( l } 10: crid if I I : Ch11 1~inclhIi1silir:tl (5'1 , S2. I<[i]! :\I, sl~flix'l'rc:c:l<oot.. tw)~lI,isl,) . 12: A(l,j~lst, 111;1tkti wigtl ts of 111at,t,Iles i11 111 lx~se(l 011 t~xous i l l cxo11List ;11i(1 t,lle ~ ) ; I I ~ I I I P ~ P ~

z41~lI'LIFYING_Ri\TI0 1:s. S(:l(:c*t. a s111xwt of n~~c : l~or s wit11 I I I ~ ~ S ~ ~ I I I I I I I t,oti~l wt'igl~t, ; I I I ~ 1)11t t l l t : ~ ~ ~ ~ I I tlic li11a1 alig~l-

I I I O I I ~ ()I' S 1 i l ~ l t l S 2 . 1.1: for cuc.11 1);1il. of itltcr-ar~c:l~or s c c l ~ ~ c ~ r ~ c ~ ~ s , .Si i ~ r ~ l .S,i do I . ~ , , C:111 IGALrKA(.5'{, S i . I<; i 4- 1, :\I. s~~f l ix ' l ' r (~t~I~oof , ! (lsol~List,) itlig11 S{ i11i(I S;. i ( i : end ~ O I .

17: reL11ru

As cliscussetl in Socl i o t ~ 5.2, tjlrc: trl;lilr p ;~r . ;~~l~otcrs 11sc~1 in GAUNA (wl~icli are kept t,llc Sill l l( '

i l l ICALJNA ;IS wvll) ;II .C K - v ; ~ l ~ ~ c ~ s . :, sill~ili~ril,!; t,llrc~sl~oltl .s. a ~ ~ t l tllc cliagol~al wi(lt,l~ ( I i l l 1.11(:

U P 1 ill)l(~.

' ~ I I ( > i~~t,cl .-al~c,l~or 1(!11g:11 I11r(:s11oItl E ! is ;L llll~osl~old t l ~ t ( ~ ~ I . ~ : ~ I I I ~ I I c s \ Y ~ I C I I t o us(: N ~ W I I ~ ~ I I ~ I ~ I -

WIIIIS(.II i ~ l g o ~ . i t l ~ ~ u i l l i~ wgion i l~s lwt l of fi~itlil~g III:I( . ( . I ICS. I t is sv1. st1(.11 l . l ~ t , I.110 0 ( 1 1 ~ )

Page 68: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

1 : FinelMasirn;~lh~I;~t~cl~crs ( S C ~ L I ~ I I ( : C St, S ( Y ~ L I ~ I I C C S2 , irit li: set A / , 110i1c S I I I I ~ X T ~ P C ~ ~ O O ~ , c~solis c\xonList) { A t will lioltl ~.l.~:it.c:l~c?s}

2: J'=(ij

:{: for t/' sulfixc~s .$ of S2 do ,1:

: if S? h;ls ; ~ n o\;c:rlap wit11 ;MI (XOII i l l (~01i1,ist then (i: CMl S(:iir(:lirrr(!(:Bri~~~(:l~i~~g ( ~ ~ ~ I l i ~ ' T r ( ~ ( ~ r < o o t , , .S2, I ) , Vj, k , (>xo~~I , i s t ) 7: clsc 8. C h l l Sc;irc.IiTrcc. (s~~liisTrc~c~Root,! &, I ) , 0, k ) : e n c l i f

10:

I 1 : for Vp; E I' do 12: 1%1(l oc.c:urrcw:es of p, ill SI ~ I . I I ( ~ i ~ t l t l t h n i to A.I 1 : cncl for 1.1: cnd for 15: R(wlo\,t! I Y ! ~ ~ I I I ~ ~ ~ I I I ~ , ~ u a t d ~ c s

rn i~tdws i111tl h d p s us c:ont,rol tIi(* r~lllning the I)y lilnililig tlic spilcc t l l i~t Dl' (:o\~cPs. AS

;I sitdc-('lf(!ct, i t iilso lilllits t,il(: i~liglllllC'lll, I > ( > Of' il ~ l ) ( Y ' i i l l (,YI)C tll;lt, (1O('s llot ilIIO\\ ' 11101.(!

t.lli411 d c:or~scuttivo gaps i l l tlw ~ l ~ ; ~ t , c l ~ o s . 'rllis is ~~c.c.ol)t;~l)lc~ l)i:causr! for ~l~at,cl~c:s c:ol~l.ail~illg

long c:ol~scx:~~t.ivo gill) il~t,cwi~ls, wc (:ill1 view t,h~sc! ;IS two sq)arat,c\ I I ~ ~ I ~ ( : I I ( > S i ~ l d i(lu~tiI'y tll(w!

i l ~ c l ( ~ ~ ) ( ? r ~ ( l c ~ ~ ~ t , l j ~ .

IJifi(!d 011 our c*sl)cril~lc:llts, I<-values grc?al.ly ;lfi!ct ttlc sl)c7c!tl all(] quiilil,)f of tllc! illig11111~11t.

i111(1 ('i1li 1 1 1 d i ~ 21 111i1,jor ( l i h w ~ c ( ~ in t IIC: q ~ ~ : ~ l i l ~ y of t,Iiv solutio~i, \VP l)(~rfor111(~1 (~xt(!r~sivc t(>st,-

illg 1.0 (l(:tcrlili~~(! I II(! qui~li ly of t,Iw i~ I ig l~ l l~<? l~ t I ) H S O ~ 011 tlifl'erc:tlt sets of lC-ixlllcs. rutuitivcly,

the I I IO~Y: 1 ~ ~ ~ ~ 1 s t , l ~ ~ r c arc. t,11c 1011gcr IGAUNA t :~ l i~s t,o ~ I I I I . l'lic l)iggx>r tlw K - \ ~ I I P . t110

longer tho nlatdtes wc f i ld shoultl bc aid tllrrc?forc if w: clloosc too largo ;I K-vi~luc.: \\I(!

llligl~t lil~(l ,jl~st, ;I fc\v I I L I I I I I ) C ~ of' I I I ~ I . ~ C ~ I ( ; S t , l~;it (lo I I O ~ ( :o\T~ 11111c~l1 of' tIl(5 two S(Y~I ICI IC( :S ill111

Page 69: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 70: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

CXIAI'7'ER 7. IGA CJNA 57

7.3 Optimal Alignment

a l i g i ~ ~ i ~ c ~ ~ r l s .

7 .

Ill(: r c ~ i ~ i l i ~ ~ i ~ ~ g rcgio~is hot\vm~i~ CSOIIS. ; t r ~ iilig~~(!(l 11si11g rog111;1r IGAtJN.4 routirws~ I l l ( !

w y wc! have! t1i:sc:rilxxI t h i ~ i l l t hc prcviuus srct ions (as opl,oscd t,o using ULAST). 'I'IIc

hi id r ~ s u l t is i1.n illiglllllc?l~t wil.11 ~l l lp l l i~s is 011 1 1 1 ~ illig11111(:1lt. of O X O I ~ regions i ~ ~ i ( l iti \vc will

sot: 1i~1.s a lliglicr scorc! wki(!n w: col~sicler colisorvrtl regiolls ovcrli lppi~~g csolis.

r 7 1 [I(! acIvt \~~Li~g(> CIS \iilvilig t l i i > s O - ( ' ~ I I I C X I opL'iv~(~d (I&!JIL,IIL(Y~,~, is t,liill-l i t , giws 11s a 1 1 w t , i ~ ~ i i l t , t ! ol'

I lOW 11111('11 il ~.('jillIill. ~.liglllll('rlt, l l i l~ t,ll(! pOt.(?llt,i;ll t ,O i1?11)1.0\'0i1 i l l tP1~111s Of c!xorl digl l l l l ( !nt~.

If t l ~ c optinlal iiliglll~~ont has il sc:orc I I I I I C ~ I Ilighcr t,lian ii giwn ;digrllll(mf., wc h1ow LI1i1t.

tlltrre is still 11111('11 1.00111 for i~nprov(m(wt i l l t.110 Loo1 Li~ilL procI~c:(~l tlic illigl~~~i(:lil.. IIOLV(IV(II,:

il' t,ili! givP11 illiglllll~llt S(:orCS VF1.y c~OS(!~\; t 0 tall(! 01)t.illlill ;lliglllll~llt: \\I(! kll0W tllklt t.hV ( l l l i l l i ! ,~ '

ol' tlic: i \ l i g ~ i ~ ~ ~ ( ' l ~ t ( , i i l ~ i o t I ) ( > i l l~l)ro\:~~(l 1 1 i u d 1 . 111 t.liis c:;~s(>: 0111. foc11s will 1 ) ~ ' 011 i l ~ i p l . o v i ~ ~ g

t l w spctcd of t l . 1 ~ tool antl how 1nuc11 sp;1w i t IISCS. So 1)asically if t\4V (liffcrmt 111~111oiIs

protlucc t.wo a l i g l~~ l~c l~ t s with scores very (:loso tu tllc opt i l~l i~l a l ig l l~~l~l l t , , t.lli:l~ t 1 1 ~ o11e w11ic.h

1)rotlucx:s tlic idignl~lont f;~stc?r ; u ~ l with lcss mcluory, has I.hc ;~.tlva~ithgc! o w r t l lo o t h c ~ OII~!.

Page 71: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

7.4 IGAUNA Parameters

Usillg tlill'c:rc~it pari111ic:tcw gives IGATJNA t , l ~ c . Il(~sil)ilil,y to IN' L I ~ for tliffcrcnt cxsc:s casil!..

By 1)r01)crly sc?t,l.i~g t.llrw l)arn~~lc?t.c:rs, IC:I\UNI\ call il(.tl~;llly 1.1111 (:x;~c:t.ly lilit: G A U S A o r i t

call I)c c:sc:c:ut,ccl t.o Iiilcl t.he opt-i111al ~ I I ~ ~ I I I I I C I I L . Bc!siclc,s t,lx o~ lcs nrc 11;lvc. ;~lro~~.cly clc~scril)otl

Page 72: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

IGAUNA Results and Conclusion

8.1 Experimental Settings

8.2 Parameter Settings

Page 73: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:
Page 74: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

(20: 7) { 'LO, 10) (25, 10) (25, 7 ) ( 3 0 , 15) (20, 10, 7) (25, 10, 7) (35, 10, 7 ) (35, 20, 7) (50, 30, 12) (40, 20, 10, 7) (45, 25, 10: 7) (50, 30, 10. 7)

c:o~~sitl(!rctl 1.11~ followi~~g I<-v;ill~c srt.s: (20: 7 ) , (20, 10): (25, 10}, (25, 71, (30 , 15): ('20,

10; 71, { 2 5 : 10, 7). (35, 10: 7): (35, 20, 71, (50. 30, 12) , (40, 20. 10, 7): (45. 25, 10: 7): (50:

30, 10: 7 ) . Wit.11 e w r y ~:Iiiu~g(: ill the sct of' I<-values. 1GAUNA's yerl'orm;i~ic:o coi~sisl.c~il lp

cha~igc!tl in ;ill t . 1 ~ sl)cc:ic:s, t21~c!rt?l:orc t.o s l ~ o w tllv c l~iu~gcs , \vc! will olily s l ~ o w t,lic: rcs111l.s for

U I I C 1)ilir of sl)v(:ics ( I I I I I ~ I I - D ~ ~ i ~ l i p i ~ ~ ~ e n t ) .

As call 1)c sc!c:11 fro^^^ TaI)Ic 8.1, c l ~ a ~ ~ g i ~ ~ g I<-valuc: scts (wi t J~ i i~ rcaso11u1)lc values) tloc's

not ~ f k c t scnsitivitv an(/ spcc:ifil.y sig~~ilicant,ly. Howewr t,he r u n l ~ i l ~ g the (:iin j~ii l ip at S O I I ~ F

p o i ~ ~ t ~ s . The S ~ I I I W 1)at,tcr11 of j u ~ ~ ~ p s i l l tinlc: applies to ot,licr specic:s a s wdl, hut witli wrying

i~~tc:i~sit.ic?s from 1.5 t,o 4 tinics inc:rcilse i l l t i~iic.

Loolci~~g at 'I';ibl(: 8.1 reveals t,lliit si~lcc: tlw ruli~iiug t,ii~ic? (low ~iot , ( : l i i l i i ~ ( ~ 1 1 1 1 i ~ l i ((:x(:q)t

for 111(! ~ I I I I I ~ ) p o i ~ i t . ~ ) , 1.0 gel. 1.lic lu;i.uiir~~~rn scnsil.ivit,g ; i ~ i t l spcx:ilil,y: wc: sho111tl c:lioost? l , l ~ >

I\'-wlue sct (25, 10: 7 ) .

Page 75: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

8.3 Alignment Results

8.3.1 Memory Usage

Page 76: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Progrim TCL ECL KEG' TCLE Tirr~c (s) k I ( m (t~tl))

tlie scqucwces a~ i t l feetling t1lc111 L o GeiicScan, ~ v c can s o l v ~ t , l ~ i ~ t prol)lcl~i. Figuiv 8.1 sl~ows

Page 77: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

CHAPTER 8. 1G24 UNA RESIJLTS A N D CO,VCL USION

In terms of specxi, the overl~cntl t h l . GCI.I~S(.~LII (:;LIISCS, is the I)igg:.~st. f'a(:li)r ill rc(Iu(:itig

IGAUNA's spertl co~upiircd t,o GAUNA. Honrc!vc:r: it is possible to c?s tract t,ho c+so~is I)y

IGAUNA ,111ow tllis convc~~ie~ l t ly ) . Bri~nchi~i:, lias it11 i~iflrlc\~~cc on ( h c s p c ~ ~ 1 its wcll. h t

paramel;c!rs. Ovc~ri~ll, illt.hol~gl~ b r a r d ~ i n g cloos rc:tlllce the spectl, it, docs not retlr~ce i t 11111(:11.

Page 78: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Table 8.6: EIurnan Mor~so Aligr~tr~rnt. Rrsulth

P I ~ ~ I ~ I I I TCL ECL E 'ELI3 T i l ~ w (s) M(w1 (11~1))

0ptirn;d 110875 5957 63 7568 97 9 1 IGAUNA 141853 5928 6 7431 130 90 G U N 12G716 5922 62 7451 5 2 S ,5 I 123425 5862 Cil 7351 14 1 205 I\V I L) 2 5777 (i0 6081 60 4!1S

Prograrli I ECL NEC T C 3 X T ~ I I I ~ . (s) kIc111 (1111))

O p t h a l 381350 11237 60 17505 394 2% IGAUNA 405007 10891 5 l(;9!)1 4 1 0 280 GAUNA 997121 lOS(i5 55 165.54 382 27 1 LAGAN :365457 104j9 56 16284 659 76 1 AVID 231496 1082 7 1929 238 1'307

8.3.3 Quality Of Aligrlrrierlts

111 ortlcr to 1)ctl:c.r cso~l~p;u.cx IGAGNA wsults wit11 i,llc other tools nicwt,ionctI, it \\.ill I)(:

I~clpful t.o c:o~~sitlcr g rap l~s sl~o\\w ill F igr~ws 8.2 I,o 8.5.

14s wc can SOP in E'igrlrcs 8.2 m t l 8.3, IGAUNA ~)crforms clrlit,c w ~ l l whcn it co~l~cts t.o

csoli cowrage. \\'11on t :ol~sit leri~~g moll (!o\lt'ri1fi(! (o~i ly cco~~sit lori~~g CXOIIS t , l~at I~avc: Iwt~11

c:ovc:rctl ~ n o r o t h n 50%) imtl also t o l ~ l cxon c:o\lcra.go ( co r~s idor i~~g all f . 1 1 ~ c:onsc:rvc!d r c g i o ~ ~ s

that fkll i ~ ~ s i t i c c so~ l s ) , IC4AUNA perfornis l)ctt,er tI1a11 d l t J ~ c o i . h tools; wcll, c:sc:cpt [or

Page 79: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

ClIAPY'ER 8. IGAUNA RESULTS AND CONCLUSION

Menwry Usage

Human-Rat Hmar i - H~~ran- Hurren-Dog Mouse-Dog Mouse-

Exon Coverage Length

I kl ouse.Dug lul ouse-Chc ken

Figure 8.2: E x o ~ Covor;igc? Lol~gt l~ for Mousc-Dog ant1 h?ousc!-Cliickc~l

Page 80: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

CHAPTER S. IC;AIJNi\ RESlJLTS A N D CONCLUSION

Exon Coverage Length

Figorc. 5.3: Ex011 Coveragc Length for H~~man-Ritt , , IIwmiin-Mo~~sc. 111i1nall-Chic~kcr1. 1Ium~il1-Dog

Total Exon Coverage

Page 81: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Total Exon Coverage

6 000 0

I IiumafkRat HunianM iruse Hurnat~Ch~r: ken HumanDug

tlle Hutuil~l-Dog (:ils(!. Mi(: c:otisitlcr t,liis (:;IS(: ill tl(:t,ail:

S~lrl)risit~gly, 1,AC:AN scorctl cvcll Iliglicr I.ll;ln o ~ ~ r nptilili~l aligtitriollt,, i l t ~ l \vo i~i\wt, i-

giltcxl c?xl)l i~~li:~t , in~l~ fbr 1,liis "inriclclit,". Aftcr aligliing tlic Hurllati-Dog socl~~otlc:es, \vc> 01)-

sc~rvctl that. LAGAN is not usi~lg tllc c:olrlplct,c scc1uclic:cs i l l it,s fitlal n.ligtlnwt~t,: Wo (!stri~(:Lctl

t,llc: origit~al S C ( ~ I I ( ? I ~ C O S fro~ti tllc ;~lig~iccI S C C I U C ~ I ~ C C S i11ic1 fo~lllcl t,llitt the Ictigth of tl~c:. sciclrlc5~~cc~s

usccl by LAGAX \wc?rt? (j/lTjllK3 for soclt~c:~~c:c: OIL(: (as oppostxl to &l!)GI)OO of tlio u~,igitiid) ii l l (1

5033G45 for scqllc:nc:o t,wo (;IS ol)posc:tl t o 6424515 of t . 1 ~ origitml). 111 chssctlcc, LAGAN

is "tlmnvil~g away" parts of thc scxlool~co that could not I)(! i-lligl~(!d p r o p ~ r l y (i.e 1% of

tllc first scqllcncc? and 16% of the s(!(:o11(1 s ~ ~ I I c I ~ ( : ( : ) . This rcs~dt,s in i i higlicr c1cnsit.y of

conscrvc!tl rcgio~ls (eit l~er in tol.al or just in t.hc cxons, tlcpcncling on wltcre tlie t , l lro\\~~-

il\vil;lr scg111~11ts I~il\r(! I)c:ct~) a.ud t,llercforc 1,AGAN will score h i g h nrllo11 ~ ~ ~ o a s u r i ~ ~ g t l ~ c

co~tscrvc?tl rc?gions. T l ~ t c:xplailis thc huge tlifferelice ill our Exo~l-Covc.r;lgc:-Lo~~gtll ;ultl

7i)tirl-Cot1sc:rvc.tl-Lc1lgt,l1-i11-Ex cstiillat,cs 1)cl.wcc:ri LAGAN i111t1 even the opt,i~~t;ll itligll-

8.3.4 IGAUlVA Improvements Compared To GAUNA

Page 82: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

to scc: how IGAIJNA hm inil)rovcd, \vc bast our con~parisu~ls o n GAUNA m t l (:01111)arillg Lo

1 1 1 ~ opl i~rial i~ l i g l l l~ l~~ l t , \VC sco how n111(:11 room Lllcrc is fur i~ripro\wnrnt. 'I'ho11 wo scc liovi

Total Exon Coverage Length Improvenient (Compared to GAUNA)

We call sec I.hat IGAUNA has i~~iprovetl wtmut. 5C)X comparctl t,o GAUNA, ~ncal~ing

l:lii~.t. it. has cowl-ctl aho111, 50% of t.hr pot,mthlly c:o\:c!ra.l>lc rogiolis 1101. prwio~lsly c:o\w.cc-l

by G A I J N A . Thc only csccpt,iol~ is hIolwx~-Cliic.kel~ i~lig~~rncnt, , Altho~igll I G A U N A Ilits

iliiprowd i l l t,l~is case as \vclll it, SNXIIS t,liat, Llicre is st,ill 11iuc11 11ior~ I , O O I I I Lor i~i i~)ro\rcl i~o~~t~

(i I . l l ( I id1 t l l ~ Ot,hCl' ~ O I S ill'(' fil~~illg hhill(1 i l l Illis (C?lsC ;IS \vCll),

In tornis of' total c.o~iservc?d rcsgions: as 11m1tio11c:cl earlier, we ( : i ~ l l l l ~ t I ~ S C 1 . 1 1 ~ opti~ili-11

;~lig~l~r~crlt, for cotnpariso~~, bc!c:i~~~sc it 1)iascs c x o ~ ~ s too 1nuc1i ant1 t.li;it 1iiig11t. i ~ ~ t ~ r f c r c wit11

t,he alig~r~~lcirlt of ot.lwr potcrit,ii~.lIj: good r~gions. Thcrdoro we colupare IGAUNA o ~ l y t,o

GAUNA alid ;IS wc cull see fro111 F ig~~ros 8.7 itlld 8.8 a.11tl the t.a.l)l(~s in Scct.io11 8.3. IGAUNA

has i~nl)rovc!tl from 5 t.o 12 1wrc:cnt.. Thc o ~ l y esc:ciptioll is Hlmm-Rat itlig11111e1lt. whew \v(\

s w c-mly i~l)o~ll. 3% i l~~pro \ :e~ l~c i~ t (\vliick is t , l ~ c s lo~igcst scqtiencc. in our tc3st sot.).

Page 83: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

Mouse- k g htlnusc-itwken

I Total Conserved Region

Human-Ral Human41 ouse Human Ch~c kerl Human-Dog

Page 84: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

CIIA PTER 3'. IGA IJNA RESIJL'LY A N D CONCIL CJSION

8.3.5 Summary of Results

8.4 Conclusion and Future Work

it. is oflic:iciit lmth ill tinlo al~cl space. Depcnding on wlictl~er spcctl or q~~ ; i l i t y nec~ls to I)(!

opt,in~izc:d, IGAIJNA's pirsit~~~rt,t!rs (:a11 I)(! fl(!xibly set t,o suit cach ir~tlivitluiil cils(!. 1 1 0 ~ -

cvcr t,hcrc is sl,ill 1iirrc41 roonl t,o i~ lcn r~)o ra te more biological Iicv~sistks illto it, in or(lcr to

iic~l~icvc-: l)t:tt,t!r ~xwdts . 111 t J ~ e salllo way tJint G A U N A is capal)lt: of p e r l ' o s l ~ ~ i ~ ~ g 11111ltiplv

scqllrmcx: ~ l i g ~ ~ ~ n ( : ~ i t , , w (:a11 l'l~rt,llor (!xt(>11(1 t,lit, (:;~pabilit,iw 01' IGAUNA t,n I ) ( ? used or1 111111-

Page 85: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

roo111 for rcLilli~~g tllc (lcfillition u i d LlSilg(! of an opti111a1 aligtl l im~t wIiic11 rcquircs il l)ctt(:r

untlcrstantling of biological scqucllccs U K I scorillg nlc~lhorls. 'I'hc:rc is also ;I polcntial Lo

illcorporate p;iri~llelism int,o IGAUNA 1)y divitlirig t,lle query st ,r i l~g iirld feeding (>ilt.ll 1 ) i ~ I .

ir~t,o a separate 1)roc:essor. M'c i l r ( : i\.lsu worki~lg 011 d(:vc:lopitl;: a user frielltlly i~lt,c!l.filct: for

IGAUNA alicl 111aIw it: avaiI;~l)lc: as i~ s t ~ ~ ~ ~ ~ l - a l o ~ ~ c progrml ii11(1 >11so >is >I \wl) i~l)l)li(:i~tio~l.

Page 86: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

[!)I C . I ~ l l I ' g e a l l t l S. I(ill ' lil1. P l . (Yl i ( : t io l l Or ( U l l l p l e t . ~ g(!IlC ~ t I . l l ~ t l 1 1 . ~ : ~ i l l I l l l l l l i l l l g( : l lOll l i ( . ( I l l i l .

,I. ~Llol. I? Io l . , 2G8:78-!)4, 1!)!)7.

Page 87: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

L3113LIoC:I?ll PI-IY 74

[14] D ~ I I I Gusfic:ltl. Al!lor.itliiris oil. St~.irr!~s, fi.cc.s (md S C ~ ~ L C ~ L C P S : C O ~ I L ~ , U ~ C I . SC~CYICC ( i d

G'orry).~~lnliontd Biolopy. USA: Chlnl)ritlgc! Univcrsit.y I'rws, [19!)7] (19!)9).

[l(i] C:il)son T . T l ~ o n ~ p s o ~ ~ J.D. Higgins D.G. Gillson T.J. Higgius D.; T ~ O I I I ~ S O I I .l. Clustd w: inlprovil~g the swsit,ivity of progrcssivcl~~ultiplc scqucmx alignnlct~~t tl~rougl-I secluel~cx: wc!igl~lil~g:positio~l-s~)ecific gal) peualtics mt l w i g h t rria.t,rix choico. Nucdcir Acids Rcs., 2'2:4673-4680, 199-1.

[24] Gcrolcl I<arp. G'dl c~ntl ~r~olrv:~rlnr. biolotpy : concr~pts arul cxpe~irne7its. C:l~icl~est,er : John \YiI(:y, 2005.

Page 88: IGAUNA (IMPROVED GLOBAL SEQUENCE ALIGNMENT USING …summit.sfu.ca/system/files/iritems1/2542/etd2903.pdf · APPROVAL Name: IlIwsoutl Hi~ri~ ti Dcgrce: MASTER OF SCIENCE Titlc of thesis:

[:32] C r ~ g o r y h'l. C:oo~)cr h,lic:l~iwl F. I(il11 El~fi( '~io I ) i ~ \ l ~ t I t ) ~ NISCJ Co1111):trativc Stxjli(~i~ci~tg l'rogri~nl Eric D. C:rcon Arcrd Sitlow h1ic:llacl Urrltll~o. Chllo~lg U. L)o il11~1 S~ri11il.11 Bi~t:zoglou. Lagan :trltl r r ~ ~ l t i - h g m : Efficic~~t. f:ools for lar~e-scalc ~ r ~ ~ ~ l t , i p l ~ i~liglllt~(:nt. of' gclloli~ic d l l i~ . G ~ ~ r m r ~ i ( , RI : s~ (L~( , / I . , 13(4):721-731, 2003.

[34] S. B. cPr. \Vu~~sclt C. D. Necdlcn~an. A gel~cral r~~et l lo t l alq>lical)lc to tllc sectrrll for s i l t~ i ln r i t i~ i irl the i ~ m i n c ~ acid scqucncc of t.wo prottr i~~s. J. iLIol. B i d , 4P:d4:3-453, 1970.

[Xi] 14'. R. P C ~ ~ S O I I it11cI D. .J. I,il)111at. 11ttprovtxI t.ools for l~iologici~l seq~~t:~ic(:s c o ~ i i p r i s o ~ ~ . h v c . !Vubl. Acc~dmvy S(31:nc.e. S5:2.144-48, 1988.

1351 .J.P, h/Icsirov B. Bcrgc:r S. J3atzoglou, I,. Puclltc:~ a ~ ~ t l E.S. I ,a~~tl( :r . Hu111al1 ; t l l t l ~ilouso g t ~ l c str11cA11rc: Ch~ipa ra t ivo u ~ i ~ l y s i s a l~ t l applicaliol~ to cxon protlictiol~. Geriorrcc! Rc:stwr(:lr, .Jr~ly 1 2000.