Repetitive DNA sequence assembly › 023d › c00548189e... · I have learnt how to conduct experiments, write papers and much more from you. I would like to thank my colleagues and

Repetitive DNA sequence assembly

by

Yongqing Jiang

Bachelor of IT (Honours)

Submitted in fulfilment of the requirements for the degree of

Doctor of Philosophy

Deakin University

November, 2017

lswan

Redacted stamp

lswan

Redacted stamp

IV

Acknowledgements

I would like to thank my supervisor, Professor Wanlei Zhou for his continuing

guidance, encouragement and support. I would like to show my deepest respect for

his personality and I am sincerely gratitude for his patience during my candidature. I

would like to thank my co-supervisor, Dr. Jingyu Hou. I could not be here without

your guidance and encourage during my honours and PhD candidature. I have learnt

how to conduct experiments, write papers and much more from you.

I would like to thank my colleagues and friends who accompany by my side dur-

ing my ups and downs: Ashish Saini, Chao Chen, Sheng Wen, Longxiang Gao, Yu

Wang, Tianqing Zhu, Xiao Chen, Derui Wang, Jiaojiao Jiang, Tianrui Zong,

Youyang Qu, Lichuan Ma and much more.

I am extremely grateful Deakin University for offering DUPR international schol-

arship, which gives me the opportunity to start my PhD without financial issues.

Last but not least, I would like to thank my parents for their endless love, care and

unconditional support. You share my happiness and sadness anytime anywhere. I

own my deepest gratitude to you.

V

Repetitive DNA sequence assembly

Mr. Yongqing Jiang

Submitted for the degree of Doctor of Philosophy

November 2017

Abstract

With the advent of next generation of sequencing, geneticists are able to obtain

high-throughput of sequence data in timely. The length of these sequence reads typi-

cally range from 20 base pairs to 150 base pairs. These short reads generate issues to

existing assembly methods which performed well on long reads from the first gen-

eration sequencing. We reviewed the performance of existing assembly methods to

NGS data. The results show: 1) most tandem repetitive DNA sequences are poorly

assembled; 2) DNA transposons are either partially assembled or incorrectly assem-

bled; 3) most retro-transposons are not assembled. However, eukaryotic genomes

contain high volumes of intronic and intergenic regions in which repetitive sequences

are abundant. These repetitive sequences represent challenges in genome assembly of

short reads and are often excluded in analysis losing invaluable genomic information.

This thesis proposes novel computational methods to assemble repetitive DNA

sequences using next generation sequencing (NGS) data. Previous methods focus on

whole-genome assembly, so they treat repetitive regions and non-repetitive regions

as same when assembling. Our methods are distinct from previous methods since

they dedicate to repetitive sequence assembly. Specifically, our methods solve three

categories of repetitive sequence: 1) tandem repetitive sequence; 2) DNA transposon

and 3) retrotransposon. Our work mainly consists three parts. First, we propose a se-

ries of novel methods to assemble repetitive DNA sequences. Second, we measure

the performance of our methods by comparing to previous methods and also compar-

ing to reference genome. Experiments based on real dataset show our proposed

VI

methods outperform other methods and are more accurate when comparing to refer-

ence genome. Three, we carry out further analysis on our assembled results. The re-

sults show: 1) variations are existing in our assembled sequences; 2) there are abun-

dant repetitive sequences at nucleic associated domain; 3) retrotransposons can be

correctly assembled but still hard to completely assemble due to their wide distribu-

tion.

Keywords: NGS, Sequence assembly, Repetitive DNA sequence, Transposon

VII

Table of Contents

Acknowledgements .................................................................................................... IV

Abstract ....................................................................................................................... V

List of Figures ............................................................................................................. X

List of Tables ........................................................................................................... XIII

Acronyms ................................................................................................................. XV

1 Introduction ........................................................................................................... 1

1.2 Background .................................................................................................. 1

1.3 Motivation and Research issues ................................................................... 2

1.4 Overview the work and its contribution ...................................................... 7

1.5 Thesis organization ...................................................................................... 8

2 Literature Review ................................................................................................ 10

2.1 Review of sequencing technology ............................................................. 10

2.1.1 Description of main sequencing technologies ............................... 11

2.1.2 Technical comparison of sequencing technologies ........................ 14

2.2 Review of data in sequence assembly ........................................................ 15

2.2.1 Types of reads ................................................................................ 15

2.2.2 Repeat reference ............................................................................. 19

2.3 Review of assembly methods .................................................................... 21

2.3.1 Assembly Methods overview ......................................................... 22

2.3.2 Overlap-Layout-Consensus Based methods .................................. 24

2.3.3 De Bruijn Graph based methods .................................................... 26

2.3.4 Measurement .................................................................................. 29

2.3.5 Issues of Assembly methods .......................................................... 32

3 Innovative method of tandem repetitive sequence assembly .............................. 34

3.1 Introduction ................................................................................................ 34

VIII

3.2 Methods ..................................................................................................... 37

3.2.1 Tandem repeat definition ............................................................... 37

3.2.2 Algorithm overview ....................................................................... 38

3.2.3 Merge of paired-end reads ............................................................. 41

3.2.4 Candidate reads selection ............................................................... 41

3.2.5 Contig generation method .............................................................. 42

3.3 Result ......................................................................................................... 43

3.3.1 Contig generation ........................................................................... 43

3.3.2 Correctness of assembly................................................................. 44

3.3.3 Contig distribution and SNP verification ....................................... 47

3.3.4 Discussion ...................................................................................... 50

3.4 Summary .................................................................................................... 53

4 Innovative assembly method of DNA transposon ............................................... 54

4.1 Introduction ................................................................................................ 54

4.2 Method ....................................................................................................... 56

4.2.1 Overview and basic terms .............................................................. 56

4.2.2 Seed selection ................................................................................. 58

4.2.3 Computing sequence matching score ............................................. 64

4.2.4 Extension algorithm ....................................................................... 66

4.3 Results ........................................................................................................ 68

4.3.1 Parameter setting ............................................................................ 69

4.3.2 Performance comparison between ABySS2, Velvet, and

SOAPdenovo ............................................................................... 74

4.3.3 Distribution and variation .............................................................. 81

4.4 Summary .................................................................................................... 88

5 Innovative assembly method of retro transposon ................................................ 89

5.1 Introduction ................................................................................................ 89

5.2 Method ....................................................................................................... 90

5.2.1 Overview and basic terms .............................................................. 90

5.2.2 De Bruijn graph generation ............................................................ 92

5.2.3 Simple path algorithm .................................................................... 95

5.2.4 Select LTR feature seeds ................................................................ 98

IX

5.2.5 Select Non-LTR feature seeds ..................................................... 101

5.2.6 Gap filling method ....................................................................... 104

5.3 Result ....................................................................................................... 106

5.3.1 Parameter setting .......................................................................... 107


SOAPdenovo ............................................................................. 116

5.3.3 Assemble nucleolus associated domain using RA method .......... 122

5.4 Summary .................................................................................................. 128

6 Conclusion and Future work ............................................................................. 129

6.1 Contributions ........................................................................................... 129

6.2 Future works ............................................................................................ 131

Bibliography ............................................................................................................. 132

X

List of Figures

Figure 1.1: Repetitive DNA sequence makes gap. ....................................................... 3

Figure 1.2: Repetitive DNA sequence in de Bruijn graph based methods. .................. 4

Figure 1.3: Repetitive DNA sequence makes issue of losing information .................. 5

Figure 2.1: Generation of single end reads. ................................................................ 16

Figure 2.2: Generation of paired end reads. ............................................................... 17

Figure 2.3: Generation of mate pair reads. ................................................................. 19

Figure 2.4: Illustration of OLC based method. .......................................................... 25

Figure 2.5: Demonstration of de Bruijn graph generation. ........................................ 27

Figure 3.1: Tandem repeats in human chromosome 14. ............................................ 38

Figure 3.2: Workflow of tandem repeat assembler. ................................................... 40

Figure 3.3: Correctness comparison of contigs assembled by different methods for

tandem repeats. ........................................................................................ 47

Figure 3.4: SNP verification and potential new SNP loci. ......................................... 48

Figure 3.5: The distribution of correct contigs. .......................................................... 49

Figure 3.6: Influence of reads quality on assembly. ................................................... 52

Figure 4.1: Types of transposable elements and their structures. .............................. 55

Figure 4.2: Overview of DNA transposon assembly method. ................................... 58

Figure 4.3: Dot matrix of two sequences. .................................................................. 59

Figure 4.4: Dot matrix of two sequences which have DNA transposon features. ..... 64

Figure 4.5: Encoding scheme of different sequencing technologies. ......................... 65

Figure 4.6: Sequence extension demonstration. ......................................................... 67

Figure 4.7: Assembly performance on different .................................................... 70

Figure 4.8: Assembly performance on different . .................................................... 71

Figure 4.9: Assembly performance on different . ................................................. 71

Figure 4.10: Assembly performance on different . ................................................ 72

Figure 4.11: Assembly performance on different . ................................................. 73

Figure 4.12: Assembly performance on different . .................................................. 73

XI

Figure 4.13: Precision comparison of DTA, ABySS2, Velvet, and SOAPdenovo

method on Human chromosome 14. ..................................................... 75

Figure 4.14: Recall comparison of DTA, ABySS2, Velvet, and SOAPdenovo method

on Human chromosome 14. .................................................................. 76

Figure 4.15: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo

method on Human chromosome 14. ..................................................... 77

Figure 4.16: The precision-recall curves of DTA, ABySS2, Velvet, and

SOAPdenovo method on Maize chromosome 8. .................................. 78


method on Maize chromosome 8. ......................................................... 79


on Maize chromosome 8. ...................................................................... 80

Figure 4.19: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo

method on Maize chromosome 8. ......................................................... 80

Figure 4.20: Assembly result of MamRep434 mapping to reference. ....................... 81

Figure 4.21: Assembly result of 28 DNA transposons mapping to reference. ........... 82

Figure 4.22: Assembly result of 103 DNA transposons mapping to reference. ......... 83

Figure 4.23: Assembly result of 163 DNA transposons mapping to reference. ......... 84

Figure 4.24: Frequency of indel lengths and variation of coding sequence lengths in

chromosome 14. .................................................................................... 87

Figure 5.1: Overview of retrotransposon assembly method ....................................... 92

Figure 5.2: De Bruijn graph with k = 24 of sequence

ATCGAACGTTTTACCAGCGATTCA. ............................................. 94

Figure 5.3: Demonstration of simple path generation process. .................................. 98

Figure 5.4: Dot matrix of two sequences which have LTR features. ......................... 99

Figure 5.5: Dot matrix of two sequences which have Non-LTR features. ............... 102

Figure 5.6: Demonstration of contig generation of retrotransposon ........................ 105

Figure 5.7: Performance of retrotransposon assembly method on different . ....... 108


Figure 5.9: Performance of retrotransposon assembly method on different . ..... 110

Figure 5.10: Performance of retrotransposon assembly method on different ..... 111


XII

Figure 5.12: Performance of retrotransposon assembly method on different . ...... 113





Figure 5.17: Precision comparison of RA, ABySS2, Velvet, and SOAPdenovo

method on Human chromosome 14. ................................................... 117

Figure 5.18: Recall comparison of RA, ABySS2, Velvet, and SOAPdenovo method

on Human chromosome 14. ................................................................ 118

Figure 5.19: F-value comparison of RA, ABySS2, Velvet, and SOAPdenovo method

on Human chromosome 14. ................................................................ 118

Figure 5.20: Assembly result of PABL_A (LTR type retrotransposon) mapping to

reference. ............................................................................................. 119

Figure 5.21: Assembly result of AluSx (SINE, non-LTR type retrotransposon)

mapping to reference. .......................................................................... 120

Figure 5.22: Assembly result of L1PB (LINE, non-LTR type retrotransposon)

mapping to reference. .......................................................................... 121

Figure 5.23: Simple path of nucleolus-associated domains. .................................... 122

Figure 5.24: NAD gene distribution. ........................................................................ 126

XIII

List of Tables

Table 2.1: Description of main sequencing technologies. .......................................... 12

Table 2.2: Comparison of sequencing technologies ................................................... 15

Table 2.3: Description of repetitive DNA sequences. ................................................ 19

Table 2.4: Databases of repetitive DNA sequence. ................................................... 20

Table 2.5: Overview of assembly methods. ............................................................... 23

Table 2.6: Measurement of assembly methods [95]. .................................................. 30

Table 3.1: Assembly of tandem repeats in human Chromosome 14. ......................... 43

Table 3.2: Summary of contigs assembled by different assemblers for tandem repeats.

................................................................................................................. 44

Table 3.3: Summary of correct assembly of assemblers for tandem repeats. ............ 45

Table 4.1: Dot matrix based alignment method. ........................................................ 60

Table 4.2: DNA transposon feature finding method. ................................................. 61

Table 4.3: The meaning of Phred quality score .......................................................... 64

Table 4.4: Parameter setting on . ............................................................................ 70

Table 4.5: Parameter setting on . ............................................................................. 70

Table 4.6: Parameter setting on . .......................................................................... 71

Table 4.7: Parameter setting on . ........................................................................... 72

Table 4.8: Parameter setting on . ............................................................................ 72

Table 4.9: Parameter setting on . ............................................................................. 73

Table 4.10: Summary of DNA transposons assembled by different assemblers ....... 78

Table 4.11: Deletion and insertion of different length in chromosome 14. ............... 86

Table 4.12: Highly diverged regions of different length in chromosome 14. ............ 86

Table 5.1: Simple path generation method. ................................................................ 96

Table 5.2: Long terminal repeat feature finding method .......................................... 100

Table 5.3: Simple repeat finding method ................................................................. 103

Table 5.4: Non – LTR feature finding method ......................................................... 103

Table 5.5: Contig generation method ....................................................................... 106

XIV



Table 5.8: Parameter setting on . ......................................................................... 110




Table 5.12: Parameter setting on . .......................................................................... 113




Table 5.16: Statistics of assembled NAD in Human genome .................................. 124

Table 5.17: No of gene in NAD ............................................................................... 125

Table 5.18: Genes that have functions only in NAD. .............................................. 127

Table 5.19: 10 gene functions only associated with NAD ....................................... 127

Table 5.20: 10 gene functions not associated with NAD ......................................... 128

XV

Acronyms

BP Base Pair

DR Directed Repeat

DTA DNA Transposon Assembler

Env Envelop

Gag Group-specific Antigen

LINE Long Interspersed Nuclear Element

LTR Long Terminal Repeat

NAD Nucleolus Associated Domain

NGS Next Generation Sequencing

ORF Open Reading Frame

Pol Polymerase

Prt Protease

RA Retrotransposon Assembler

SINE Short Interspersed Nuclear Element

SNP Single Nucleotide Polymorphism

TIR Terminal Inverted Repeat

TE Transposable Element

TRA Tandem Repeat Assembler

TSD Target Site Duplication

UTR Untranslated Regi

1

Chapter 1

1 Introduction

1.2 Background

DNA (Deoxyribonucleic acid) encodes the genetic information of living organ-

isms. It consists of 4 basic nucleotides, adenine (A), guanine (G), cytosine (C) and

thymine (T). Determining the sequence of DNA in an organism is the essential step

to do downstream biological analysis, such as finding pathogenic genes. Sequencing

technology plays the role of reading the sequence of a given DNA. The first genera-

tion of sequencing technology started in 1977, with the invention of Sanger’s method

[1]. This method combined with whole genome shotgun sequencing was the main

method by which first human genome had been sequenced in 2003. However, it took

3-4 years and 300 million dollars sequencing 3 billion base pairs of human genome

with read length of ~900 base pairs [2]. The obvious problem here is how to assem-

ble these fragment reads back into complete DNA sequence.

Sequence assembly methods were introduced to address this issue. With massive

parallel sequencing emerging from 2004, represented by Roche 454, Illumina GA

2

and ABI SOLiD, high-throughput reads have been yielded greater than Sanger’s

method by several orders of magnitude. We call these sequencing technologies as

next generation sequencing, because they have faster sequencing speed and lower

cost on producing reads, comparing to Sanger’s method. With next generation se-

quencing (NGS), a human genome can be sequenced in a week. Meantime, NGS

generates ~300 million reads with even shorter length, ~100 base pairs. This makes a

big data issue and most importantly raises problems of assembling repetitive DNA

sequences.

Repetitive DNA sequence is a natural feature of genome in most organisms [3]. It

takes a large proportion of DNA sequence. For example, it comprises 50% of human

genome, 37% of mouse genome, up to 80% of maize genome [4], even the whole ge-

nome of Arabidopsis thaliana [5]. These repeat sequences range size from 1 base to

millions bases, and differ function in simple tandem repeats and transposable ele-

ments [6]. In this thesis, we only concern repetitive DNA sequences of which length

are longer than read length.

In terms of the pattern of repetitive DNA sequence, there are repeat patterns re-

peating many times adjacent to each other in a DNA sequence, or same repeat pattern

spreading over multiple places across whole genome. However, NGS generates short

read length. This feature of NGS data challenges sequence assembler because short

read contains limited information which is not able to assemble repetitive regions.

Sequence assembly methods are based on overlap information of reads. Thus, limited

overlap information leads a poor assembly result with gaps and errors, especially for

the regions of repetitive DNA sequences.

1.3 Motivation and Research issues

To date, Human reference genome comes to its 37th version. However, there are

still gaps in this reference [7]. It means we still cannot assemble a complete genome

sequence. From sequence assembly point of view, repetitive DNA sequence corners

assembly algorithms to dilemma. Especially for NGS reads, short length cannot cov-

er a long complete repeat sequence. What makes it worse is that transposable ele-

ments widely distributed in genome, making location ambiguous. We list the issues

of repetitive DNA sequence to assembly methods as follows:

3

Repetitive DNA sequence makes gap in assembly result. Repeats are usu-

ally ignored in existing assembly methods. The reason is that most repeti-

tive DNA sequences are longer than sequence read. To date, the capability

of sequencing technology can only read 50-150 base pairs per time. And

this length cannot cover most repetitive DNA sequences of which the

length are ranging from 200-500 base pairs [5]. It means that we can hard-

ly assemble repetitive DNA sequence. Especially in De novo assembly

methods, we have to assemble sequence read without reference sequence.

In this case, we have no knowledge about what repeats are in genome se-

quence. To avoid mistakes of assembling repetitive regions, sequence

reads that potentially related to repetitive regions are masked. As a result,

most repetitive regions are not assembled. As illustrated in Figure 1.1, a

region with high frequency reads is masked. Methods will filter out these

sequence reads, because this region is regarded as a potential repetitive re-

gion. Consequently, an assembly result with gaps will be generated.

Gap

Original sequence Assembly result

Masked region

: Sequence read

Assemble

Figure 1.1: Repetitive DNA sequence makes gap.

Repetitive DNA sequence makes the assembly of whole genome sequence

shorter than the reference genome. This fact was reported by [8], that re-

cently assemblies using NGS data were much more fragmented than refer-

ence assembled years ago. The reason is that most repetitive DNA se-

quence are longer than sequence read and current assembly has limitations

to tackle this issue. For example, de Bruijn graph based methods break

down sequence read into , which are substrings of sequence read

in length k. And use to generate de Bruijn graph in which

4

are connected by overlapping bases. Based on the path of

de Bruijn graph, we can deduce the DNA sequence. However a repetitive

DNA sequence will present branches or loops in de Bruijn graph. This

forces assemblers to have strategies to choose a path in de Bruijn graph.

As illustrated in Figure 1.2, a repetitive sequence is broke down into

to , and then constructed into a de Bruijn graph. As we can see,

it is hard to find a path in this graph to assemble back the original se-

quence, as there are loops and direction choice problem. In Figure 1.2, the

sequence, TTCTGAGGCTGACCCTGAAA, is constructed into a de

Bruijn graph with 3-mers. This sequence contains a repeat of which the

pattern is CTGA. For de Bruijn graph based assembly methods, it is con-

fusing to select the correct path. The erroneous path selection will assem-

ble a short sequence, for example, TTCTGAAA. For a large genome, re-

petitive DNA sequence will make de Bruijn graph even complicated.

TTC TCT CTG TGA GAA AAA

GCT GAG

CCT GAC

CCC

GGC

ACC

AGG

Original Repetitive sequence: TTCTGAGGCTGACCCTGAAA

Figure 1.2: Repetitive DNA sequence in de Bruijn graph based methods.

Repetitive DNA sequence makes assembly result erroneous in resequenc-

ing project. Resequencing project is to assemble a DNA sequence of

which the reference sequence is available. This means, for example, when

we assemble the repetitive DNA sequence part, we can refer to the refer-

5

ence sequence and “copy and paste” the repetitive DNA sequence part into

assembly result. By this way, final assembly result will not differ much in

length comparing to reference sequence. However, variation information,

such as SNP (Single Nucleotide Polymorphism) is lost in this procedure.

As illustrated in Figure 1.3, when there are loops in the de Bruijn graph,

we can search available repeat database then “copy and paste” suitable re-

petitive DNA sequence into assembly result. This makes up the issue of

short assembly, but it will generate a fake result which has no real varia-

tion information.

Repeat Reference database:

…CTGAGGCTGACCCTCA

... TTC TCT CTG TGA GAA AAA

GCT GAG

CCT GAC

CCC

GGC

ACC

AGG

Original Repetitive sequence: TTCTGAGGCTGACCCTGAAA

TTC TTCLeft flank

Rightflank

Search

Assembly result of Repetitive sequence: TTCTGAGGCTGACCCTCAAA

SNP

Figure 1.3: Repetitive DNA sequence makes issue of losing information

Assembly of repetitive DNA sequence cannot be fully solved by paired

end reads. Paired end reads are a pair of reads with a known distance be-

tween them. The structure of paired end reads is

. Some assembly methods will use this structural feature to de-

duce repetitive DNA sequence. Based on the known distance between

head read and tail read, if one read located in unambiguous region and the

other read located in repeat region, then some nucleotides in repetitive re-

gion can be deducted. If there are enough paired end reads covered in a re-

6

petitive region, theoretically this region can be correctly assembled. How-

ever, there are many regions in a genome share significant similarity. In

this case, assembly methods are confused of where paired end reads actu-

ally come from. For example, two repetitive regions are significantly simi-

lar but in different locations. For assembly method, it is hard to tell how

many regions should be assembled, because the paired reads seem to come

from one region. In fact, transposable elements are widely distributed in

genome and they are important part of genome sequence, for example

there are about 85% transposable elements in maize genome.

According to the issues listed above, it is a trade-off between correct assembly

and complete assembly. If a method wants to assemble a complete sequence of a ge-

nome, it must assemble repetitive regions. However in existing models, to assemble

repetitive regions always comes with errors. Most existing assembly methods are ge-

nome wide assembly oriented. In other words, they are trying to correctly assemble

as much sequence in genome as possible. So they bypass the issues of assembling

repetitive regions. In this thesis, we are motivated to tackle the issues of assembly

repetitive DNA sequences.

There are different types of repeat sequence in genome. Each type of repeat se-

quence has different problems to assembly methods. Based on the issues discussed

above, there are number of important research questions that need to be address in

terms of different types of repeat sequence. We list our research questions as follows:

How to correctly assemble tandem repetitive DNA sequence? Tandem re-

peat is a sequence which has a pattern duplicates several times. The length

of repeat pattern can be longer than the length of sequence read. So one

sub-question is how to determine the number of times a pattern duplicates.

We studied the reference of assembled tandem repeats. It shows that a tan-

dem repetitive sequence does not duplicate its pattern in integer times. In

other words, it may stops repeating at any position where is possible in the

middle of the pattern. Moreover, there could be variation, mismatch and

indel between patterns. To correctly assemble a tandem repeat, we should

7

also assemble its flank sequences at both ends. So another sub-question is

how to recover the flank sequences of a tandem repeat.

How to correctly assemble DNA transposon? DNA transposon is one of

the transposable elements. It follows a “cut-and-paste” mechanism to

move in the genome. The sequence of DNA transposon has a unique struc-

tural feature. There is a pair of inverted repeats at both ends of a DNA

transposon. One sub-question is how to identify sequence reads which are

related to DNA transposon. Assume the sequence reads which are related

to a DNA transposon are identified, there is still problem of bad quality

reads which lead to bad assembly result. So another sub-question is how to

filter out bad quality reads and assemble DNA transposon properly.

How to correctly assemble retrotransposon? Retrotransposon is one of the

transposable elements. In detail, retrotransposon has two types, i.e. LTR

type retrotransposon and non-LTR retrotransposon. Retrotransposon fol-

lows a “copy and paste” mechanism to move in the genome. For a re-

trotransposon, it could have hundreds of copies widely distributed in a ge-

nome. So, one sub-question is how to assemble these significantly similar

copies in different locations. Two types of retrotransposon are different in

structural feature. Specifically, there is no long terminal repeat in non-LTR

type retrotransposon. So, another sub-question is how to use structural fea-

ture to assemble both types of retrotransposon.

1.4 Overview the work and its contribution

To address the issues and research questions listed above, we proposed novel

methods of assembling repetitive DNA sequence in this thesis.

Firstly, to address the problem of assembling tandem repetitive sequence, we pro-

pose tandem repeat assembler (TRA), a novel method of assembling tandem repeats.

We pre-process the paired end reads which are possible to merge into a long read

from and . From all processed reads, we select out the candidate

reads which are related to tandem repeat. A greedy algorithm then will be applied to

candidate reads to get a draft assembly. All candidate will align to the draft assembly

8

and a majority vote method will be applied to update the result. This dynamic result

will end until a static result generated. Using an experimentally acquired data set for

human chromosome 14, tandem repeats >200 bp were assembled. Alignment of the

contigs to the human genome reference (GRCh38) revealed that 84.3% of tandem

repetitive regions was correctly covered. For tandem repeats, this method outper-

formed state-of-the-art assemblers by generating correct N50 of contigs up to 512 bp.

Secondly, to address the problem of assembling DNA transposon, we propose

DTA, a novel method of assembling DNA transposon. We studied the structure of

DNA transposon and propose a novel method of selecting pairs of head seed and tail

seed using dot matrix. Then we propose a quality based extension method to connect

head seed to tail seed. We evaluate our method on two species, i.e. human chromo-

some 14 and maize chromosome 8. In the measurement of precision, recall and f-

value, our method is superior to previous methods. We also investigate the distribu-

tion of our assembled DNA transposons.

Thirdly, to address the problem of assembling retrotransposon, we propose RA, a

novel method of assembling retrotransposon. Based on the structural feature, we pro-

pose two novel methods to select pairs of seeds for LTR type retrotransposon and

non-LTR retrotransposon respectively. We also propose a recursive method to fill up

the gap between the seeds. We evaluate our method on a benchmark dataset and it

shows that our method outperforms state-of-the-art methods. In addition, we use our

RA method assemble a genome wide dataset which contains reads of nucleolus asso-

ciated domains of human genome. We conducted several analysis on assembled re-

sult and it shows that our RA method is able to assemble retrotransposon in genome

wide scale.

1.5 Thesis organization

This section presents the overall organization of this thesis. As the objective of

this thesis is to address the problems of assemble repetitive DNA sequence, the con-

tent of each chapter is organized as follows:

Chapter 2: This chapter presents a comprehensive survey on sequencing

technology, sequencing reads and assembly methods. We reviewed as-

9

sembly methods in two types, i.e. overlap layout consensus based method

and de Bruijn graph based method.

Chapter 3: We propose a method of assembling tandem repetitive se-

quence. We present our method on how to use pair end reads to correctly

assemble tandem repeats. The method follows a dynamic process which

will stop until a static result assembled.

Chapter 4: We propose a method of assembling DNA transposon. We pre-

sent the methods in two parts, i.e. dot matrix based DNA transposon fea-

ture finding algorithm and quality based extension algorithm.

Chapter 5: We propose a method of assembling retrotransposon. We pre-

sent the methods in three parts, i.e. LTR type retrotransposon feature find-

ing algorithm, non-LTR retrotransposon feature finding and recursive gap

filling algorithm.

Chapter 6: This chapter summarizes the research in this thesis and suggest

the potential works in the future

10

Chapter 2

2 Literature Review

In this Chapter, we present the review of sequencing technology, sequence data

and assembly methods. The review of sequencing technology revels the reason why

repetitive DNA sequence causes issues in DNA sequence assembly. Each sequencing

technology generates unique type of sequence data which is referred as read. Reads

from various technologies differ in length, read quality, read direction and read depth.

Evolution of sequence read drives the development of assembly methods. In this

chapter, we first review sequencing technologies; second we review the data in my

research and finally present existing assembly methods.

2.1 Review of sequencing technology

DNA (Deoxyribonucleic acid) was presented as genetic material in 1944 and its

double helical strand structure discovered in 1953. Features of living beings, such as

eye color (phenotype), are determined by specific information on DNA (genotype).

To understand the nature of organisms, genetics are dedicated to reveal these coding

information. DNA is consisted of 4 basic nucleotides, adenine (A), guanine (G), cy-

tosine (C) and thymine (T). Determining the sequence of DNA in an organism is the

first step to do downstream biological analysis, such as finding pathogenic genes.

Since the first gene, i.e. Bacteriophage MS2 coat protein, was sequenced in 1972 [9],

biologists have been eagerly developing sequencing technology. To date, next gener-

11

ation sequencing (NGS) technology can sequence a whole human genome in one

week.

Sequencing technology provides the order of nucleotides for a DNA sequence.

Thanks to Sanger’s method proposed in 1977 [1], Homo sappiness’ genome se-

quence issued in 2003. Technically, it took 3-4 years and 300 million dollars se-

quencing 3 billion base pairs of human genome with read length of ~900 bases [2].

Despite of the successful application of Sanger’s method, great cost of time and

money limited development of genetics. With advanced biochemistry technology

incorporated, new methods were invented from 1977, including whole genome shot-

gun sequencing [10], hierarchical shotgun sequencing, mixed strategy sequencing,

reduced representation sequencing [11], EST sequencing [12], Pyrosequencing [13],

sequencing-by-synthesis [14], sequencing-by-ligation [15], semiconductor-non-

optical sequencing [16], single molecule real time sequencing [17] and other tech-

nology being developed now.

2.1.1 Description of main sequencing technologies

However, relatively long reads and high accuracy make Sanger’s method combined

with shotgun sequencing dominate in sequencing area nearly 20 years until 2004.

This sequencing technology is regarded as first generation sequencing which is still

served as a “gold standard” in genome projects [18]. While massive parallel sequenc-

ing technologies emerged from 2004, represented by Roche 454, Illumina GA and

ABI SOLiD, they yield high-throughput reads greater than Sanger’s method by sev-

eral orders of magnitude [19]. Fast sequencing speed, low cost on per base, as well as

high-throughput of reads, advance sequencing technology into Next Generation. It

gives the opportunity of sequencing thousands of large genomes for human, animals,

and plants [20, 21, 22]. Table 2.1 lists the details of main sequencing technologies.

From the assembly point of view, these sequencing technologies differ in:

The size of inserts

Whether the sequencing was uni or bi-directional

Library construction

Cloning vector

12

Selection of clones to be sequence

Number of DNA molecules sequenced in parallel

Length of read

Quality information for each base

Table 2.1: Description of main sequencing technologies.

Year Sequencing Technology Description

1977

Sanger’s method Sanger’s method is based on chain termination

method. In a sequencing reaction, a primer is

combined with a fragment of single-strand DNA.

After polymerase adding to the mix, dideoxynu-

cleotide triphosphates (ddNTPs) and deoxynu-

cleotides (dNTPS) will synthesize single-strand

DNA. When a ddNTPS is added to the strand,

synthesis will stop. And label on it will tell the

synthesis position. After gel electrophoresis and

autoradigraphy, fragments will be put in four

lanes to get final order.

1979 Shotgun sequencing Shotgun sequencing randomly shears the ampli-

fied DNA sequenced into fragments. Average

number of reads cover the given nucleotide is

predefined, to make sure all nucleotides in DNA

sequence will be sequenced. After the fragments

are sheared into suitable length, sanger’s method

will read these fragments [23].

2004 Roche 454’s technology Roche 454 uses pyrosequencing technology [24].

First, DNA library with 454-specific adaptors

denatured and combined with beads. After emul-

sion PCR amplification [25], synthesis will

begin. Specific dNTP will complement to the

13

single-strand DNA with ATP sulfurylase, lucif-

erase, luciferase, luciferin, polymerase and

adenosine 5’ phosphosulfate. When synthesis is

processing, pyrophosphate will be released then

transformed to ATP. ATP drives luciferin into

oxylucierin, which yields visible light [26]. By

telling different light, nucleotides can be deter-

mined.

2006 Illumina’s technology Illumina uses sequencing by synthesis technolo-

gy [27]. DNA library with adaptors attached to

the flowcell. Given Bridge amplification, cloned

DNA fragments form clusters. By adding linear-

ization enzyme, four types of nucleotide with

fluorescent dye will complement to the single-

strand DNA. The color of fluorescent dye will

then be captured by a charge-coupled device, to

tell the specific nucleotide.

2007 ABI SOLiD technology ABI SOLiD uses sequencing by ligation tech-

nology [28].First, DNA library with adaptors is

amplified by PCR, then captured by beads. Af-

terwards, an 8 base-probe will move alone DNA

fragment. When base-probe complementary to

the fragment, the last 3 nucleotides with fluores-

cent dye are cleaved and system will capture the

color of 4th

base. Then base-probe moves on

from the cleavage site, same routine will occur.

After 5 rounds moving, colors of all bases are

captured, and sequence can be deduced.

2009 PacBio technology PacBio’s technology makes polymerase bound to

DNA library. And these DNA templates are im-

mobilized on a 50 nm-wide wells, which block

14

light from fluorescent dye but allow energy pen-

etrated. When synthesis is processing, energy

will excite the fluorescent on nucleotide. And a

specific pulse will be captured based on type of

fluorescent, so that the corresponding based can

be deduced.

2011 Ion Torrent technology Ion Torrent utilizes the protons released when

nucleotide synthesis [29]. First, DNA library

combined with primers immobilized on 3-micron

diameter beads, are amplified under emulsion

PCR. Then these templates are put into proton-

sensing wells. After sequencing beings, specific

nucleotide is incorporated and protons will be

released. Signal is subsequently detected to de-

termine the base.

2.1.2 Technical comparison of sequencing technologies

Next generation sequencing technologies were beginning with Roche 454. It

abandons bacterial cloning step replaced by PCR amplification [30]. No matter what

specific base recognition method is adopted, analysis processes are massively paral-

lel. These features fasten the speed of DNA sequencing and beat Sanger’s method.

However, different base detecting methods have their own inherent limitations. For

example, Illumina’s sequencing technology would fail the detection of bases near 3’

and 5’ end of DNA fragments. And all next generation sequencing technologies gen-

erate shorter reads than Sanger’s method. Table 2.2 compares these sequencing tech-

nologies from the point of output.

15

Table 2.2: Comparison of sequencing technologies

Read

Length

Quality Yield

/Run

Time

/Run

Cost

/Mb

Paired

end

Sanger’s

method

400-

900bp

Mostly>Q35 84 Kb 3 hours $2400 Yes

Roche 454

GS FLX

700bp Mostly>Q30 0.7 Gb 24 hours $10 Yes

Illumina

HiSeq 2000

101bp Mostly>Q30 600 Gb 3-10 days $0.07 Yes

ABI

SOLiDv4

35-50bp Mostly>Q30 120 Gb 7-14 days $0.1 Yes

Ion Torrent

PGM

200-

400bp

Mostly Q20 1 Gb 2 hours $1 Yes

PacBio

RS

Up to

2200bp

< Q10 100 Mb 2 hours $2 No

2.2 Review of data in sequence assembly

From the perspective of repetitive sequence assembly, we are interested in the

type of data which are generated from sequencing technology. Also, we are interest-

ed in repetitive sequence data which are available to be utilized in sequence assembly.

Sequence data is also referred as sequence reads which are consisting of 4 nucleo-

tides (Adenine, Guanine, Cytosine and Thymine).

2.2.1 Types of reads

There are three types of reads, i.e. single end reads, paired end reads and mate pair

reads. Generation processes are demonstrated in Figure 2.1-2.3. Due to different se-

quencing technologies used, these reads are different in terms of single or pair, read

length and read orientation.

16

1. Single End reads: In the demonstration of Figure 2.1, adaptors (A1, A2) are

ligated to the fragmented sample DNA. After amplification, clusters of DNA

fragments are generated. Nucleotides will be read from primer sites SP1. Due

to the limitation of sequencing technology, it is only able to sequence the first

few nucleotides of a DNA fragment. For example, you have a DNA fragment

sample construct with average 500 bp. For single read output, you may only

get first 100 bp of each DNA fragment.

Figure 2.1: Generation of single end reads [31].

2. Paired End reads: Paired end reads are two Single End reads coming from

both end of a DNA fragment, on different strand. In the demonstration of

Figure 2.2, adaptors (A1, A2) are ligated to the fragmented sample DNA. Af-

ter amplification, clusters of DNA fragments are generated. Nucleotides will

17

be read from primer sites SP1 and SP2. The orientation of these two reads is

FR (Forward Reverse) where forward read is from 5’ to 3’ and reverse read is

from 3’ to 5’.

Figure 2.2: Generation of paired end reads.

3. Mate Pair reads: Mate pair reads are two single end reads coming from both

end of DNA fragment on different strand, but with longer insert length than

paired end reads. In mate pair sequencing, fragmented sample DNA is end-

biotinylated to tag the mate pair section. After self-circularization and re-

fragmentation, DNA fragments with biotin tag are amplified. Followed by

18

similar procedure of paired end sequencing, mate pair reads will be generated.

The orientation of these two reads is RF (Reverse Forward) where reverse

read is from 3’ to 5’ and forward read is from 3’ to 5’.

19

Figure 2.3: Generation of mate pair reads.

2.2.2 Repeat reference

In thesis, we address the issues of how to correctly assemble repetitive DNA se-

quence. So we are interested in the sequence data of available repeat sequence. Most

available repeat references were identified in an assembled DNA sequence. And the-

se information help the research for assembly of repetitive DNA region, but available

resource is limited. In 1961, first satellite DNA was discovered [32] and later trans-

posable elements were discovered in Maize genome [33]. At that time, DNA was

merely regarded as macromolecule of heredity. Finally in 1968, researchers recog-

nized that DNA not only contains genes, but also repetitive elements [34]. All these

discoveries revolutionized the knowledge against non-coding part of DNA sequence.

To date, it is widely accepted that repetitive sequence is associated with human dis-

eases [35], genome evolution [36], gene evolution [37], and gene expression [38].

In terms of length and mobility, repetitive DNA sequences are classified into two

categories, i.e. tandem repeats and interspersed repeat. Table 2.3 gives the descrip-

tion of repetitive DNA sequences. DNA transposon is not included in interspersed

repeat category. Comparing to retrotransposon, DNA transposon differs in the mo-

lecular mechanism of formation, but not in mobility. Similar reason to segmental du-

plications and other classes, they are not included in this table either.

Table 2.3: Description of repetitive DNA sequences.

Tandem

Repeat

Microsatellite Microsatellite is also known as simple sequence

repeat. It is repeating sequence of 1-10 base pairs.

The pattern is repeated and repetitions are directly

adjacent to each other.

Minisatellite Minisatellite is also known as variable number tan-

dem repeat. It is repeating sequence of 10-60 base

pairs. The pattern is repeated and repetitions are

directly adjacent to each other.

Satellite Satellite is repeating sequence of long length. The

20

pattern is repeated and repetitions are directly adja-

cent to each other.

VNTR VNTR is variable number tandem repeat. The

length of repeat always shows variation between

individuals.

Interspersed

Repeat

SINE

SINE is short interspersed nuclear elements. They

are normally repetitive DNA sequences of 100-300

bp and spreading throughout genome.

LINE

LINE is long interspersed nuclear elements. They

are normally repetitive DNA sequences of more

than 300 bp and spreading throughout genome.

LTR

LTR is long terminal repeats. They are duplicated

hundreds of thousands times and located at end of

retrotransposons.

Table 2.4: Databases of repetitive DNA sequence.

Database Organism Classification Scale Availability

TIGR [39] Plant Tan. & Inter. 12 http://plantrepeats.plantbiology.msu.edu

/index.html

PSSRDb [40] Prokaryotes Tandem 85 http://pssrdb.cdfd.org.in

SSRD [41] Human Tandem 1 http://www.ccmb.res.in

TRDB [42] Mix Tandem 22 https://tandem.bu.edu/cgi-

bin/trdb/trdb.exe

MIPS [43] Plant Tan. & Inter. 11 http://mips.helmholtz-

muenchen.de/plant/recat/

TREP [44] Plant Tan. & Inter. 2 http://wheat.pw.usda.gov/ITMI/Repeats/

PlantSat [45] Plant Tandem 23 http://w3lamc.umbr.cas.cz/PlantSat/

Repbase [46] Mix Tan. & Inter. 31 http://www.girinst.org/repbase/update/in

dex.html

The significance of repetitive DNA sequence drives the development of repeat

finder algorithm. Served as the most widely used program, RepeatMasker identifies

repeats by comparing query sequence to reference repeat database [46]. However,

21

the necessary of available repeat sequence limits the capability of reference based

methods. Instead, ab initio algorithms try to find repeat sequence without reference,

such as ReAS [47] and RepeatScout [48]. For more information about repetitive se-

quence discovery methods, please refer to these review papers [49, 50, 51].

Public available repeat databases are listed in Table 2.4. In Table 2.4, Classifica-

tion indicates if a database contains tandem repeat or interspersed repeat. Whereas

Scale indicates the number of species a database includes. GenBank [52] and ENA

such databases are not included, as they are not dedicated to repeat sequence.

2.3 Review of assembly methods

Assembly is the process of merging all reads back into complete sequence. It is

the fundamental step to undertake downstream biological analysis, such as polymor-

phism discovery, genome wide association. The necessary of assembly, makes it

bind with sequencing technologies and dependent on specific reads they produce.

Moreover, repeat sequence in DNA makes assembly complex, even a dilemma when

the length of read is too short. Several papers dedicated to assembly methods were

published [8, 53, 54, 55, 56, 57]. Except performance comparison, one fact they pre-

sent is, the result sequence assembled by these methods is shorter than it supposed to

be.

In this section, assembly methods based on next generation sequencing data will

be presented, as well as the issues in this area. Depending on the availability of refer-

ence genome, assembly algorithms can be classified as either reference-guided as-

sembly or de novo assembly. In this thesis, assembly algorithms are grouped into 3

categories:

Overlap-Layout-Consensus (OLC) model based method

De Bruijn graphs (DBG) model based method

Greedy extension based method

In this section, assembly algorithms will be reviewed in 2.3.1. Representative of

each category will be discussed in section 2.3.2 (OLC model based method) and sec-

tion 2.3.3 (DBG model based method). The measurement of assembly algorithm will

22

be presented in 2.3.4. And issues of current assembly algorithms will be discussed in

2.3.5.

2.3.1 Assembly Methods overview

The history of assembly algorithm is profound influenced by sequencing technol-

ogy. Following the birth of Sanger’s sequencing, in 1980, the model utilizing over-

lap-layout-consensus (OLC) to assembly sequencing reads was first proposed [58].

As shotgun sequencing widely applied later, sophisticated assembly algorithms were

developed based on OLC model, such as ARACHNE [59], Atlas [60], Celera As-

sembler [61], CAP3 [62], Euler [63], Newbler [64], PCAP [65], Phusion [66], Phrap

[67] and RePS [68].

During this time, an alternative method was proposed in 2001, known as Eulerian

path [69]. However, it was not valued during the age of Sanger’s sequencing. With

the advent of next generation sequencing (NGS) in 2004, short reads make OLC

model perform not as well as it handles long reads. Instead, Eulerian path approach

takes advantage of short read, which makes it well suited in NGS assembly. Thus,

novel methods were proposed based on de Bruijn graph model (DBG), which is first

adopted by Eulerian path approach. These include EULER [70], ALLPATHS [71,

72], ALLPATHS-LG [73], Velvet [74], ABySS [75], FragmentGluer [76], SO-

PAdenovo [77, 78] and Sparse bitmaps [79]. With the generation of graft genome of

panda [80] and cucumber [81], genome assembly with NGS reads culminated in a

new stage.

Unlike previous models, greedy extension based methods do not model sequence

reads into graph. Instead of regarding reads as vertices, greedy extension methods

regard reads as string and try to extend them greedily. These include SSAKE [82],

SHARCGS [83], VCAKE [84], QSRA [85] and SGA [86]. Other methods try to tan-

gle assembly from different point of view. For example, Ray [87] combines sequenc-

ing reads from Illumina and Roche 454 to reduce contig number and sequence error,

while Taipan [88] combines greedy extension and graph based method to optimize

assembly.

Table 2.5 gives an overview of assembly methods. It abstracts these methods in 7

properties, i.e. type, QC, PE, PE-C, reference, repeat and parallel. QC represents

23

whether a method adopt quality control. PE represents whether a method depends on

paired end read. PE-C represents if a method uses paired end read to construct con-

tics. Reference represents if a method is based on reference. Repeat represents if a

method has a dedicated module tackle repeats. Parallel represents if a method sup-

ports parallel computing.

Table 2.5: Overview of assembly methods.

Type QC PE PE-C Reference Repeat Parallel

ARACHNE OLC Yes Yes

Atlas OLC Yes Yes

Celera OLC Yes

CAP3 OLC Yes

Newbler OLC Yes

PCAP OLC

Phusion OLC Yes

Phrap OLC Yes

RePS OLC Yes

EULER DBG

ALLPATH DBG Yes Yes Yes

Velvet DBG Yes Yes

ABySS DBG Yes Yes

SOAPdenovo DBG Yes Yes

SSAKE DBG

SHARCGS DBG

VCAKE DBG

QSRA DBG Yes

SGA DBG Yes

Ray Other Yes Yes

Taipan Other Yes

24

Regardless of data preprocessing, repeat assembly and other concerns, whole ge-

nome sequence assembly follows common 3 steps:

1. Contig construction: Contig is the continuous DNA sequence built by the

overlap reads. This always refers to the long unambiguous sequence.

2. Scaffolding: Scaffolding refers to the determinant of orientation and orders

of contigs [89, 90, 91]. This process can be seen as the linkage of contigs,

which is called scaffold as well.

3. Gap closure: Gap closure tries to fill the gaps in the scaffolding between

contigs. This process attempts to complete whole DNA sequence [92, 93].

2.3.2 Overlap-Layout-Consensus Based methods

Normally, Overlap-Layout-Consensus based methods follow 3 steps.

1. Overlap step: all-against-all, pair-wise reads comparison will be adopted.

This process depends on the minimum number of overlap bases and simi-

larity between the overlap bases. For similarity, the number of mismatch

and indel are taken into consideration. When assembly of long reads, reads

related to repeat sequence are always masked during this procedure.

2. Layout step: based on the overlap information of step 1, reads regarded as

vertices are connected. Sequence errors are usually corrected in this stage.

After this, an overlap graph will be drew.

3. Consensus step: based on the graph generated in step 2, multiple sequence

alignment is adopted and determines the Hamiltonian path of the layout

graph. Hamiltonian path tries to visit each node in graph exactly once. To

merge all nodes in this path, one can get a continuous sequence.

In Figure 2.4, we demonstrate the process of OLC based method. In this case, a 32

bp DNA is sequenced. The read length is 10 bp, and the minimum number of overlap

is set as 5 bp. All 6 reads are aligned under the original DNA sequence, and OLC

graph is drafted based on the overlap information. The next step is to find a path, vis-

25

iting every node exactly once. In this scenario, result would be easily deduced:

R1>R2>R3>R4>R5>R6.

Figure 2.4: Illustration of OLC based method.

We reviewed the related works of OLC based method as follows:

ARACHNE

ARACHNE begins with the trimming procedure. Based on Phred quality, ex-

tremely low score bases are trimmed, while reads mainly consisting of low

quality bases are eliminated. And also, reads sequencing from bacterial host

and cloning vector are abandoned. After this, ARACHNE starts overlap detec-

tion between pair-wise reads. Instead of all-against-all reads comparing (

times comparisons), ARACHNE compares k-mer, which scales linearly. At

this stage, k is set as 24, and 24-mers with extremely high frequency are ex-

cluded, as they are potential repeats. Based on the overlap information of k-

mers, program identifies their corresponding reads. In this way, overlap of

reads will be deduced. Then, layout stage follows 3-step alignment module,

26

which is known as merge-extend-refine. It is followed by an error correction

step in which majority rule applied. Between overlap reads, a penalty score is

calculated based on the discrepant base, its quality and its flanking bases. This

comes to an overall penalty score to the alignment. With penalty score, final

layout is generated and contigs can be easily merged. Scaffolding procedure is

straightforwardly linking two contigs, by plasmid reads. For gap in the scaf-

fold, ARACHNE tries to find a repeat contig, which does pairwise overlapping

the contigs in scaffold.

Experiments: Simulated reads were generated, providing 10-fold coverage of

genome H. influenza, S. cerevisiae, D. Melanogaster and human chromosomes

21 and 22. The assembly of D. melanogaster made a ~98% coverage with N50

contig length of 324 Kb and N50 Scaffold length of 5143 Kb. Small errors

were detected almost 1 per 1 Mb.

Atlus

Atlus combines reads generated from BACs and Whole-genome shotgun

(WGS). Reads from BACs offer localization assembly, which reduce computa-

tional power; while reads from Whole-genome shotgun provide different insert

sizes, which benefits scaffolding. Assembly process begins with trimming. It is

notable that trimmed reads are only used to compare but not assembly. Poten-

tial repeat reads are discarded during pairwise overlap, by setting maximum

frequency as 12. K-mer value is set as 32-mer, and maximum mismatch be-

tween any 32-mer is set as 3%. All BAC reads and WGS reads are compared,

and only 6 WGS reads were kept by computing the weight. After this, overlap

graph is generated. In the gap closure step, WGS read with different insert

length was utilized to link scaffolds.

2.3.3 De Bruijn Graph based methods

The core of de Bruijn graph based method follows 4 steps.

27

1. All substring of length K is derived from sequencing reads. This is known

as . Based on the frequency of , reads with high frequen-

cy can be masked during this procedure.

2. De Bruijn graph is generated based on all . Each is re-

garded as vertex. Vertex i and j are connected by a directed edge (i, j), if

there is in DNA sequence that contains i as prefix and j as

suffix.

3. Based on the de Bruijn graph, unique path is merged into unambiguous se-

quence. Unambiguous sequence combined with remaining part of de

Bruijn graph, forms a repeat graph.

4. Based on the repeat graph, traverse algorithm will be applied to find an Eu-

ler path. Euler path tries to visit each edge in the graph exactly once. Then,

merging nodes in this path will form a long continuous sequence.

Figure 2.5: Demonstration of de Bruijn graph generation.

28

In Figure 2.5, we present the process of de Bruijn graph generation. First, 3-mer is

generated from a DNA sequence. Then two 3-mers are connected if the suffix of one

3-mer is same as the prefix of the other 3-mer. A 3-mer graph will be generated after

connecting. Finally, nodes in the single path are merged into unambiguous sequence,

forming a de Bruijn graph [94].

We reviewed the related works of de Bruijn graph based method as follows:

ABySS

ABySS short for assembly by short sequence, a de novo genome assembly

method, which is de Bruijn graphed based. The innovation is from the per-

spective of computing efficiency. It adopts parallel computing techniques to

construct de Bruijn graph and use hash function to store sequence information.

The method contains 2 stages, in which at first they input sequences and con-

struct data structure to generate contigs; second they exploit paired-end in-

formation to do contig merging. Before stage 2, read errors, “dead-end”

branches in graph, are eliminated. The length of these branches is shorter than

a threshold, counted from the ambiguity point by tracing backwards from

their end. In stage 2, two contigs are deemed as linked only because more

than 5 paired-end read can be joined by both of them. Following, linked con-

tigs are merged by searching the de Bruijn graph in purpose to find a unique

path. 36-mer is the parameter they choose to generate the contigs.

Experiments: The genome of an African male publicly released by Illumina

were sequenced in 3.5 billion paired-end reads. By parallel assembling using

ABySS, 2.76 million contigs with more than 100 base pairs were generated,

which represents 68% of the reference human genome. N50 size is 1499 bp.

Velvet

Velvet begins with hash storing read and k-mer relations. The value range of

k is allowed from 21 to 25. As lower value of k, it would lead to increase the

complexity of de Bruijn graph. After generation of de Bruijn graph, two steps,

“simplification” and error removal are adopted. In simplification, unambigu-

29

ous path in the de Bruijn graph is merging into one node, which saves

memory space and calculation time. In error removal, 3 modules are function-

ing:

1. Removing tips: tip is a chain of nodes that is disconnected on one end.

This happens when k-mer inherits the sequencing errors from sequenc-

ing read. Simply removal these tips does not affect the connectivity of

graph.

2. Removing bubbles: a Tour Bus algorithm is adopted. Two paths are con-

sidered as redundant if they start and end in the same node. This is called

a bubble, which is created by errors or biological variants.

3. Removing erroneous connection: erroneous is removed based on a user

defined threshold. These connections do not create any loop and covered

by little reads.

After these processes, “Breadcrumb” is adopted to make contig merged

through repeat area. And this is achieved by mapping paired end reads.

Experiments: Velvet is tested on short read data sets whose read length rang-

es from 25-50 bp. With paired read only, up to 50-Kb N50 length was gener-

ated for simulation of prokaryotic data. For real data, velvet generated almost

8 Kb contig with a prokaryote and 2 Kb with a mammalian BAC.

2.3.4 Measurement

Gold standard is not available to measure the quality of a sequence assembler.

Based on the different projects, researchers defined their metrics from their own per-

spectives. From biological point of view, it expects sequence assembly method to

generate complete sequence. Whereas, from the computational point of view, it is

crucial to consider the memory cost, parallel independent and computational time.

The absent of gold dataset for test is another factor causing the difficulty of meas-

urement. When consider an assembly method, one cannot trust the quality of assem-

bling one species from the result of assembling another species [55]. Moreover, the

fast development of sequencing technology will yield novel type of sequence reads,

which makes corresponding evolution of sequencing assembly. It makes the proposal

30

of gold dataset impossible. However, there are still some common measurements

widely adopted. In Table 2.6, we present existing measurement of assembly methods.

Table 2.6: Measurement of assembly methods [95].

Metric name Units Description

N50 - A weighted median of the lengths of items. N50 was

calculated by ordering all contigs from the longest to

the shortest, then adding the length from the longest

until reaching 50% of total length of all contigs. The

last added length is the size of N50. With regard to

assemblies the items are typically contigs or scaf-

folds.

NG50 - Whereas N50 sets the median in relation to the total

length of all items in the set, we define NG50 to be

normalized by the average length of the a1 and a2

haplotypes instead of the total length of all sequences

as in N50; it is thus more reliable than N50 for com-

parison between assemblies.

CPNG50 bp Contig path NG50. The weighted median of the

lengths of contig paths.

SPNG50 bp Scaffold path NG50. The weighted median of the

lengths of scaffold paths. Scaffold paths represent

maximal concatenations of contig paths and scaffold

breaks that maintain correct order and orientation.

Structural

errors

Counts An error within a contig or scaffold. Errors include

intra- and inter-chromosomal joins, insertions, dele-

tions, simultaneous insertion, and deletions and in-

sertions at the ends of assembled sequences.

31

CC50 bp Correct contiguity 50. The empirically sampled dis-

tance between two points in an assembly, where the

two points are as likely to be correctly aligned as not.

Substitution

errors

Counts per

correct bits

Number of substitution errors per correct bit. Substi-

tution errors are columns in the alignment where the

a1 and a2 haplotypes contain either the same base

(homozygous) or different bases (heterozygous) and

the alignment contains a base (or IUPAC symbol)

different from either a1 or a2. The metric uses a bit

score to allow for IUPAC symbols.

Copy number

errors

Proportions For a given haplotype column in the MSA, the copy

number of the simulated genome can be described as

an interval (min, max). Assemblies with a copy

number outside of this interval are classified either as

an excess, for being above the interval, or a deficien-

cy, for being below the interval.

Coverage Percent The coverage is the percent of columns in the MSA

of the target (the whole genome, regions of a specific

annotation type, etc.) that contain positions of the

assembly.

Genetic

correctness

Percent The genic correctness is the percentage of base pairs

in spliced transcripts from the haplotype sequences

that align to the assembly with 95% coverage using

WU-BLAST.

E-sizes - The expected size of contig or scaffold of a location.

∑

Where the length of contig C, and G is the genome

length estimated by the sum of all contig lengths.

Others - Other assessment involves the investigation of ge-

nome feature, such as [96] examined the fraction of

32

certain genes, which presented in all genomes.

2.3.5 Issues of Assembly methods

1. Limited application to different sequencing read. DNA sequencing technolo-

gy affects assembly methods directly in read length, sequencing coverage and

base quality. Sanger’s long read is more suitable for Overlap-Layout-

Consensus based method than de Bruijn graph based method. Conversely,

next generation sequencing’s short read is more suitable for de Bruijn graph

based method. It is hard to see Sanger’s sequencing project to a 33× sequenc-

ing depth, as it generates redundant information for assembly methods. How-

ever, it is necessary for NGS technology having 33× sequence depth to assure

completely sequencing [97]. There is no model adapting the change of se-

quencing technology, though some attempts combined advantages of existing

models [88].

2. Noise data always exist. Especially for next generation sequencing, base call-

ing procedure is dependent on the color emit from fluorescent dye. In fact, the

location and shape of color determine the base call. However these infor-

mation are not always consistent, since the close deploy of flowcell. Wrong

base calls make sequence read erroneous.

3. Base quality information is not fully utilized. To measure correctness of base

calling, quality score is statically calculated. However, these information is

not fully utilized in sequence assembly method. Though several methods

adopted trimming process before assembly, others directly assemble using

raw reads without preprocessing. Except data preprocessing, quality infor-

mation is abandoned. However, it can be used as weight during overlap com-

parison.

4. Paired end reads and mate pair reads are only used for scaffolding. Some as-

sembly method only accepts paired reads [75]. But like others, they are only

used for contig linkage and gap closure. [98] used paired end reads to con-

struct contig, but only on small genome. The features of paired end infor-

mation are not fully reflected.

33

5. Inherent limitation of existing model. In Overlap-Layout-Consensus methods,

Hamiltonian path needs to be found after generation of layout graph. Howev-

er, finding a Hamiltonian path is NP hard, making it not suitable for compli-

cated graph [99]. For de Bruijn graph based method, the necessary of k-mer

generation leads to information loss of original sequence read. It is possibly

making a problem that path in de Bruijn graph is not supported by the under-

lying reads [100].

6. Repetitive DNA sequence problem. The existence of repeat makes sequence

assembly always bypass this issue. In Overlap-Layout-Consensus method, k-

mer frequency is counted to mask repeat related sequence. And those reads

are not included in layout graph so that subsequent contig loses repeat infor-

mation. In de Bruijn graph based method, traverse algorithm always stops at

loop path, facing the choice of direction, which loses repeat information.

Though NGS tries to compensate short length with high sequencing depth,

the fraction of problems caused by repetitive repeats is only depending on

read length [97].

7. No standard measurement and benchmark data set. Published assembly

methods are using either new simulated data or their own real sequencing da-

ta. It is hard to compare the performance of these methods. Different proto-

cols lead these methods data set sensitive. Longest length of contig cannot

distinguish superior performance.

34

Chapter 3

3 Innovative method of tandem repeti-

tive sequence assembly

3.1 Introduction

The next generation sequencing (NGS) technology, which provides high-

throughput sequence data at low costs, is a major development towards the under-

standing of genomic information at population level, enabling genomic investigation

of the diverse living species [101]. These genomic data can impact prominently on

medicine, agriculture and environmental conservation. NGS-derived data are dis-

played as numerous short sequence reads and the daunting task in interpreting NGS

data involves de novo assembly of these short sequences into contigs, scaffolds and

then the genome sequences, or comparative alignment of these short reads to known

reference genomes [102, 103]. In this process, one challenge to the sequence assem-

35

bler tools is the erroneous reads which require filtering and corrections. A second

and more daunting challenge is the large number of repetitive sequences in the reads

that complicate graph assembly by converging or collapsing in the path and are diffi-

cult to separate [5, 54, 95].

The eukaryote genomes are inflated by non-coding sequences in the forms of in-

tervening introns and intergenic sequences and, in early genetic studies, these ge-

nomic regions were largely grouped as `junk or selfish’ DNA [104, 104, 106]. The

human genome consists of approx. 25,000 genes which collectively occupy merely

approx. 1% of the entire genome [107], albeit recent studies showing low level tran-

scription of many non-coding genomic regions as regulatory, long non-coding RNAs

(lncRNA) in the nuclei [108, 109]. An important feature of these non-coding regions

is the abundance of repetitive elements of various lengths, forming more than 55% of

the human genome [110]. Among these repetitive sequences, the transposable inter-

spersed repetitive elements that are either DNA transposon or retroelements, that in-

cludes long terminal repeats (LTRs), long interspersed element (LINEs), and short

interspersed elements (SINEs), contribute hugely to genome size with tandem repeats

(i.e. satellite DNA, microsatellite and low complexity sequences) contribute to lesser

extent [110, 111]. However, tandem repeats account for 3% of the human genome

[112].

Unlike the genome-wide interspersed distribution of most transposable elements,

tandem repeats are found in 10-20% eukaryote genes and promoters [113]. In the

human genome, tandem repeats are found in 6% of the coding regions [114, 115].

Tandem repeats are highly unstable during replication giving rise to high frequencies

of repeat length polymorphism and this intimately affects gene expression and the

structure of the gene products with some causing variation in cellular phenotype,

evolution progression, or pathological development [113, 114, 116]. For example,

Huntington’s disease is caused by variation in tandem repeats [117]. Expansion of a

CAG repeat in the androgen receptor gene caused mutation leading to X-linked spi-

nal and muscular atrophy [118]. Besides monogenic diseases, tandem repeat poly-

morphism are also associated with polygenic diseases [119]. Somatic variations of

tandem repeats have been used as biomarkers for cancers [120]. However, the chal-

lenge in accurately detecting these repetitive sequences in the genome makes it diffi-

36

cult to investigate their disease-associated variations. Firstly, while NGS genomic

data can be obtained in high throughput, these sequence reads are short and often fall

inside repetitive regions during assembly causing short, error-prone scaffolds [121].

Omitting these reads in analysis leads to poor coverage of these genomic regions.

Secondly, PCR detection of tandem repeat regions is interfered by the generation of

stutter products [122].

The de Bruijn method is widely used in graph construction. It works by breaking

NGS sequence reads into k-mers as vertices or nodes and assembly of these k-mers is

directed by using k-1 bases as edges [102, 103]. Traverse methods are used to gener-

ate contigs from these complex graphs. K-mers derived from repeat-containing reads,

especially when the corresponding repetitive regions are longer than the k-mers, loop

structures or bubbles will form in the graph, confounding subsequent contig genera-

tion. For example, tour bus algorithm was applied in the Velvet method to merge

these bubbles by consensus but it embeds uncertainty and error information into the

consensus path [74]. In ABySS, bubbles are removed based on read coverage select-

ing path in the graph with maximum support [75]. The core consideration in these

strategies is to align paired-end reads back into the draft de Bruijn graphs seeking

alignment supports. Similar strategies are adopted in other methods [73, 94, 123]. A

common challenge to these methods is the over-simplification of repetitive regions

leading to uncertainty and errors. Assembler tools emphasizing on tandem repeats

have been developed but improvements in the assembly outcome were partial or

conditional [124, 125, 126].

In this chapter we report TRA (Tandem Repeat Assembler), a new method focus-

ing on the assembly of tandem repeats and evaluate this method using human chro-

mosome 14 as a benchmark dataset [54]. With regard to tandem repeat assembly, this

new method outperformed other state-of-the-art assemblers.

The rest of this chapter is organized as follows. First, we give definition to tandem

repeats and overview the method of TRA algorithm. Second, we present the details

of our method. Third, we evaluate our method by comparing to the-state-of-art meth-

od using benchmark dataset. Finally, we conclude this chapter.

37

3.2 Methods

3.2.1 Tandem repeat definition

If a repeat can be aligned to multiple locations in genome, then alignment scores

(Bit score) of these locations will be similarly high. We consider this repeat as a

transposon repeat. While a repeat can only be aligned to one location with high score,

we consider it as tandem repeat. So we aligned all tandem repetitive regions >202 bp

(downloading from TRDB), against chromosome 14 of human reference (GRCh38).

For each repetitive region, we calculate Δbs which reflects how much difference be-

tween its best alignment at one location and its second best alignment at another lo-

cation. Δbs = (Bit scorebest_align-Bit scoresecond_best_align)/Bit scorebest_align. The less Δbs

value is, the more locations a repetitive region can align to. Extremely, if the admis-

sion condition of tandem repeats is Δbs=0, then all repetitive regions are considered

as tandem repeats. However, since they have identical copies at different locations,

these repetitive regions actually should be considered as the transposon repeats. In

this thesis, repetitive region with Δbs >0.65 is defined as a tandem repeat.

The objective of this work is to assemble tandem repeat regions in which a repeat

motif, also known as a repeat pattern, is duplicated adjacently. We downloaded tan-

dem repeats found in human chromosome 14 from the Tandem Repeat Database [41].

1302 repetitive regions with motifs longer than 101 bp, were selected from a total of

27552 repeat regions and these were then aligned against chromosome 14 (GRCh38).

For each repetitive region, bit scores of its alignments were examined and, if the bit

score decreases > 65% between the first two best alignments at different locations,

we define this region as a tandem repeat. In Figure 3.1, how tandem repeat is distin-

guished from all other repetitive regions is illustrated. Based on these criteria, 169

repetitive regions are defined as tandem repeats, accounting for ~13% of the 1302

shortlisted repetitive regions (Figure 3.1).

38

Figure 3.1: Tandem repeats in human chromosome 14.

3.2.2 Algorithm overview

The TRA algorithm exploits paired-end reads with short insert size and repetitive

motifs to assemble tandem repeat genomic regions. The algorithm includes 4 steps.

In the first step, each pair of reads is merged into one read (Figure 3.2a). In the se-

cond step, all processed reads are aligned to a list of known repeat motifs extracted

from the genome and a read is selected as a candidate read for further processing if

its substring has k-bp overlap in length with any repeat motif (Figure 3.2b). Thirdly,

these candidate reads are used to generate draft contigs through overlap layout pro-

cessing. Reads that share the minimum threshold overlap of length n are merged and

the longest merged sequence is used as a draft contig (Figure 3.2c). Lastly, based on

this draft contig, all candidate reads are aligned simultaneously. By voting each base

from the aligned read, a new draft contig is achieved. Contig update continues till

voting result is stable (Figure 3.2d).

Figure 3.2 demonstrates an example of tandem repeat assembly. (A) Illumina

paired-end reads merging process. In this case, the read length is 100 bp covering

span of 170 bp. There is 30 bp overlap between Readforward and Readreverse. So they

are merged into one read with 170 bp in length. (B) Candidate selection. Given a

39

repeat motif, every read is compared with it. In the figuer, reads a, b and c are

selected out as candidates, since they have n bp overlap with the repeat motif. (C)

Candidate reads overlap layout and draf contig generation. Candidate reads are

merged into one if they have an overlap over a threshold. In this case, read a and b

are merged into one. (D) Contig generation. All candidate reads are aligning under

the longest draft contig, contig_1. Afterwards, new contig, contig_2 is generated by

voting each base from aligned reads. Then, contig_2 will server as the draft contig

aligned by all candidate reads again to get a new contig_3. This process will stop

until the contig_n generated is the same as contig_n-1.

40

Figure 3.2: Workflow of tandem repeat assembler.

41

3.2.3 Merge of paired-end reads

Illumina’s Paired-end reads provide 5’-3’ DNA sequences from both strands of

each DNA fragment. Therefore, each pair of read can be expressed as “Readforward-

gap-Readreverse”. Readforward is sequence read from the forward strand and Readreverse

stands for sequence read from the reverse strand. When a long DNA fragment is se-

quenced, sequences will only be generated from both ends leaving a gap which is not

sequenced. If the paired-end reads span sufficiently long from both ends to cover re-

petitive regions, these repeats can be correctly assembled after gap is filled. If the

insert size of paired-end reads is not twice as long as the read length, there will not

be a gap between the two reads and sufficient overlap may be found between the two

opposite reads. Insert size here is counted from the 5’ end of Readforward to the 5’ end

of Readreverse. TRA exploits overlap in each paired-end reads with short insert size by

merging “Readforward-gap-Readreverse” into Readmerged. Specifically, TRA first converts

Readreverse into its reverse complement and then aligns it with its paired Readforward. If

the tail region of Readforward overlaps with the head region of the reverse-

complemented Readreverse, TRA merges the two reads as Readmerged. Otherwise, the

“Readforward-gap-Readreverse” structure remains unchanged. After the paired-end reads

are thus processed, TRA selects candidate reads for tandem repeat assembly.

3.2.4 Candidate reads selection

In TRA, repeat motifs were used to select related candidate reads for subsequent

assembly of specific tandem repeats. In this study, tandem repeats on human chro-

mosome 14 were downloaded from the Tandem Repeat Database (TRDB). Using the

YASS method at its default setting [127], all tandem repeats were aligned to the hu-

man reference. Tandem repeats that aligned with similar bit scores to multiple loca-

tions of the human reference, which were likely to be transposons, were excluded.

For the rest of the tandem repeats, their motifs were used to select candidate reads.

We only used motifs that were longer than 101 bp so that these motifs cannot be

covered by any single sequence read and are problematic for most available assem-

blers.

42

Candidate reads are those which cover repeat regions. Given the sequence of a re-

peat motif, it was encircled by linking its head to its tail and then cut into 20-mers,

which are any substring of length 20 bp. All sequence reads were examined and

those containing any 20-mer of the motif were selected. Reasons for setting 20-mer

as the threshold rather than longer ones were firstly to retain sequence reads that con-

tain genuine variations as, with higher threshold, these reads risk to be omitted. Sec-

ondly, this also helps retain reads that only contain one or two erroneous bases.

Thirdly, some sequence reads only cover a small number of bases at either end of a

repeat region but they are also valid candidates conveying valuable information. Fi-

nally, repeat motifs were obtained from assembled genomes and their qualities can

vary. Setting a lower, 20-mer as a threshold is expected to allow candidate reads to

be selected even when the motif sequence has errors.

3.2.5 Contig generation method

After the selection of candidate reads, TRA assembles these reads back into com-

plete tandem repeats. For any set of selected candidate reads, these were firstly ar-

ranged in descending order following the number of bases that overlapped with the

repeat motif. From the most to the least overlapped reads, two reads are merged if

they exhibit > 20 bp overlap. This step is performed 5 times allowing from 0 to 4 in-

del in the overlap. The longest merged sequence serves as a draft contig under which

all candidate reads are simultaneously aligned. As the threshold base overlap for

alignment is set at 20 bp, candidate reads displaying < 20 bp overlap with the draft

contig are ignored. For the two types of reads, Readmerged and Readforward-gap-

Readreverse, different alignment strategies were applied. Each Readmerged is aligned un-

der a draft contig where it exhibits the most matching bases with the draft contig. For

a Readforward-gap-Readreverse, Readforwad and Readreverse are separately aligned under

the draft contig following the same strategy as for Readmerged. After alignment is

completed, all locations on the contig that are aligned with candidate reads are exam-

ined and, in three types of conditions, the candidates are omitted. First, the coverage

by Readforward and Readreverse spans more than 230 bp. Second, the aligning location of

a Readreverse is in front of the aligning location of Readforward. Third, only one read in a

43

pair can align to the draft contig. This can be only Readforward aligns to the head of the

draft contig or only Readreverse aligns to the tail of the draft contig.

After aligning candidate reads under the draft contig, a new contig was reached by

voting each of its bases from aligned reads. This new contig will then serve as the

draft contig for aligning all candidate reads. With the updated contig, candidate reads

that were not qualified in previous alignment, are re-examined based on voting activ-

ity. This dynamic updating of contig stops when base information ceases to change

in the voting result.

3.3 Result

3.3.1 Contig generation

An experimentally generated data set of human chromosome 14, as previously

used in evaluating the performance of assemblers [54], was used to evaluate the per-

formance of this new TRA method. Briefly, paired-end reads of 101-bp in length

with average 155-bp insert size were used in the assembly. Firstly, the two reads

from each paired-end reads are merged if their overlap exhibits less than 20% mis-

match (See methods) and, as a result, 90.4% of all reads were merged. The rest read

pairs lacked overlap due mainly to the error-prone 3’ sequences associated with the

Illumina platform and were left unprocessed. The 169 tandem repeats indentified on

chromosome 14 (> 101 bp in length) were used to select reads for constructing con-

tigs. This led to 169 contigs among which 167 contigs were > 200 bp with an N50

size of 562 bp (Table 3.1). N50 was calculated by ordering all contigs from the

longest to the shortest, then adding the length from the longest until reaching 50% of

total length of all contigs. The last added length is the size of N50. N90 is defined

similarly.

Table 3.1: Assembly of tandem repeats in human Chromosome 14.

Contigs To-tal

Contigs > 200 bp

N50 (bp)

N90 (bp)

Total length (bp)

Coverage

169 167 562 378 92399 bp 99.5%

44

3.3.2 Correctness of assembly

To determine whether the assembled contigs recovered the original tandem repeti-

tive regions, each contig was aligned to its corresponding original region, “Flankleft-

tandem repeat-Flankright”. The flank was 5 bp in length at each end along with tan-

dem repeats downloaded from TRDB [41]. A contig is considered to have recovered

its corresponding tandem repeat if it covers the “Flankleft-tandem repeat-Flankright”.

Among the 169 tandem repeat regions used to assemble the contigs, 141 were recov-

ered after aligning with the contigs. This recovering step did not ensure that these

recovered regions were correctly assembled. The accuracy of each base in these con-

tigs was validated by aligning these contigs to the human reference assembly

(GRCh38) and mismatch rate for each contig was generated. As allelic variations ex-

ist in the genome, some mismatches do not necessarily represent errors. Examining

the SNP database (dbSNP) in NCBI identified 1925546 active SNPs on the

107043718-bp human chromosome 14 (~1.8%). Therefore, a contig is considered

correctly assembled if it contains < 2% mismatches from the reference assembly. The

contigs are otherwise considered erroneous.

Table 3.2: Summary of contigs assembled by different assemblers for tandem repeats.

Assembler Correct Num

Error Num Total

Weak Num

Bad Num

Missing Num

TRA 123 46 20 24 2 Velvet 37 132 56 74 2 SOAPdenovo 38 131 35 94 2 SGA 28 141 55 84 2 ABySS2 64 105 25 78 2 Allpaths-LG 10 159 17 140 2 Bambus2 58 101 22 87 2 CABOG 58 101 15 94 2 MSR-CA 58 101 19 90 2

In Table 3.2, we summarized the contigs assembled by different assemblers for

tandem repeats. The threshold of match rate is set as 98%. Correct contig has >98%

45

match with reference. Erroneous contig consists of 3 categories, Weak contig, Bad

contig and Missing contig. Weak contigs aligned 98% against reference after trim-

ming either head or tail. Bad contigs have mismatch rate >2%. Missing contigs are

those shorter than 200 bp in length.

By these criteria, 123 of the 169 TRA-assembled contigs (72.3%) were correct

(Table 3.2).Twenty of the rest 46 contigs reached more than 98% match with the ref-

erence after end-trimming (weak contig) and the rest 24 contigs still exhibited less

than 98% match with reference sequence (bad contig). Two contigs were less than

200 bp and were considered failed assembly (missing contig). After excluding the

bad and missing contigs, this TRA method covered 84.3% of all tandem repeats on

chromosome 14 with a correct N50 of 512 bp (Table 3.3).

In Table 3.3, we summarized correct assemblies of assemblers for tandem repeats.

Total length of all tandem repeats is 92841 bp. For TRA, Only correct contigs and

trimmed weak contigs were counted. Correct N50 was calculated by ordering these

contigs from longest to shortest, then adding the length from the longest until reach-

ing 50% of 92841 bp. Correct N40 and correct N30 are defined similarly. For availa-

ble assembles, we obtained substring of their scaffolds, which aligned to reference of

corresponding tandem repeats. Still, only correct contigs and trimmed weak contigs

were used. “-” means failing at this assessment. It caused by the low coverage of

contigs correctly assembled for tandem repeats.

Table 3.3: Summary of correct assembly of assemblers for tandem repeats.

Assembler Corr. N50 (bp) Corr. N40 (bp) Corr. N30 (bp) Coverage

TRA 512 570 622 84.3% ABySS2 - 306 491 45.3% Allpaths-LG - - - 9.5% Bambus2 - 148 446 40.7% CABOG - - 426 37.5% MSR-CA - 64 487 40.3% SGA - - - 28.9% SOAPdenovo - - 57 30% Velvet - - 228 35.5%

46

To compare the performance of TRA with other assemblers in the analysis of tan-

dem repeats, the above specified tandem repeat regions were extracted from the

GRCh38 reference sequences and compared with scaffolds generated using the state-

of-the-art assemblers, including Velvet [74], SOAPdenovo [77], SGA [86], ABySS2

[75], Allpaths-LG [73], Bambus2 [128], CABOG [129] and MSR-CA[129]. In these

studies, scaffolds were assembled for whole chromosome 14 as detailed by Salzberg

et al. [54]. In this study, substrings in these scaffolds, that aligned to the correspond-

ing tandem repeats in the reference, were extracted. Examining the assembly accura-

cy of the above methods in these tandem repeat regions, revealed the best accurate

rate being 36.1% derived from the ABySS2 method with 64 contigs being correctly

assembled (Table 3.2).

In this new TRA tool, when an assembled contig display > 98% match with the

reference sequence, it is defined as a correct assembly taking into consideration of

SNP rate on chromosome 14. We then asked whether more contigs could be quali-

fied as correct contigs when the stringency is reduced by decreasing the threshold

match rates. The percentages of correct contigs were then estimated at a range of

match rates from 80% to 100% and the results are summarised in Figure 3.3. Match

rate refers to the percentage of matched bases in contig, comparing to reference.

When match rate threshold is 100%, it allows no mismatch in assembled contig.

Since a real dataset was assembled in our experiments, so it is unreasonable to accept

no variation. While if too many mismatch allowed, then the “correct contig” could be

false positive. When counting correct contigs, threshold of match rate is set as 98%,

to minimize the number of false negative and false positive. This is based on current

percentage of SNP in chromosome 14 (~1.8%). Though, TRA outperforms other as-

semblers whatever the match threshold is.

The method that was most sensitive to this change of match rate threshold was

SOAPdenovo which exhibited improvement in the percentage of correct contigs

from 23% to 28%. However, improvements in other methods were more trivial. TRA

remained the one single outstanding performer in this specific function. This again

shows the limitations of these prevailing assemblers in the assembly of tandem re-

peats from short sequence reads with the best performance being < 40% correct con-

47

tig assembly. TRA focuses on tandem repeat assembly and its accurate rate persisted

above >70% at high match rate thresholds.

Figure 3.3: Correctness comparison of contigs assembled by different methods for

tandem repeats.

3.3.3 Contig distribution and SNP verification

Mapping all correct contigs to reference human chromosome 14 sequence re-

vealed that the TRA-assembled tandem repeats aligned with regions throughout the

entire chromosome, except the chromosome head region (Table 3.4). The lack of

alignment with the head region was due to the lack of annotation in this 16-Mb end

region, accounting for 14.9% of the chromosome. However, the TRA-assembled

tandem repeats are densely mapped to the tail region of chromosome 14. While the

relationship between the TRA-assembled tandem repeats and the unannotated chro-

mosome head region cannot be determined without the reference sequence, it stresses

the importance in accurate assembly of repetitive sequences for constructing contin-

48

uous chromosome assemblies. The power of TRA in assembling contigs from tan-

dem repeats makes it a valuable tools to incorporate in the state-of-the-art assemblers

for scaffold construction.

Figure 3.4: SNP verification and potential new SNP loci.

In Table 3.4, the blue sequence is human reference while the black sequence is a

contig assembled by TRA. All red mismatches are verified SNPs in database of SNP

(dbSNP). Two green mismatch locations are potential new SNP locis. Alignment was

conducted by using YASS [127].

49

Figure 3.5: The distribution of correct contigs.

By comparing TRA-assembled correct contigs and the corresponding reference

sequence, a significant number of mismatches were recorded with positional infor-

mation for further analysis. As all bases in the correct TRA-assembled contigs were

supported by multiple aligning reads (see methods), these mismatches can potentially

represent genuine single nucleotide polymorphism (SNP). When the total of 261

50

mismatched locations were vetted with reference to the NCBI dbSNP, 80 locations

were found to be verified SNPs. This cannot be concluded for the rest 181 mismatch-

es. An example of SNPs in a TRA-assembled tandem repeat is shown in Table 3.5.

All correct contigs assembled by TRA are mapping to their locations at reference of

Chromosome 14. The left arc represents whole chromosome 14. While labels repre-

sent all correct contigs, forming the right arc. The no linking region at bottom corre-

sponds to 16 Mb unknown bases at the beginning of the chromosome. This graph

was drew by visualising software, Circoletto [133]. Based on the alignment, this tan-

dem repeat spans nucleotides 18967401 to 18967978 on chromosome 14. It contains

7 mismatches and 5 of these mismatches are verified SNPs in the dbSNP. The other

two mismatches could be potential SNPs identified using TRA. We also noted that

this region was not covered by assemblies derived from other assemblers.

3.3.4 Discussion

Tandem repeats are frequently found in gene promoters and coding regions and

polymorphism in these repeat regions can affect the structure, expression and func-

tions of the gene products and are associated with various diseases. The polymor-

phism of these repeats has been explored as powerful biomarkers for monogenic and

polygenic diseases but the extent of applications is hindered by the unique challenges

in accurately identify these repeats. The swift generation of massive sequence data

has revolutionized de novo and comparative genomic studies but the sequence reads

are short which make reconstruction of repetitive regions from these short reads a

daunting challenging. In this study, we have developed a new sequence assembly

tool specifically to circumvent these challenges.

TRA exploits the paired-end nature of NGS sequence reads and validated tandem

repeat motifs in reference sequences to assemble repeat-containing reads. The TRA

method was evaluated using the benchmark dataset of GAGE [54], through assem-

bling tandem repetitive regions on human chromosome 14. With this dataset, the

TRA method achieved a recovery of 84.3% of tandem repeat regions covered by cor-

rect contigs with N50 of 512 bp. This is significantly higher than the coverage

reached by currently available assembly methods among which the best coverage,

51

which was achieved with ABySS2, was 45.3%. A significant new feature that we

introduced to TRA was the pre-selection of sequence reads related to specific ge-

nomic regions to assemble these regions (See methods). Therefore, in assembling

each contig, the identities of the paired-end reads included were defined. This ena-

bles evaluation how reads quality affects contig accuracy. In this experiment, paired-

end reads of 101 bp were used with average 155-bp insert size. All read pairs are ex-

pected to exhibit overlap between the forward and reverse reads. In reality, only 90.4%

of the sequence pairs could be merged based on threshold overlap and those which

were not merged exhibited errors in 3’-end sequences.

In theory, increased inclusion of TRA-unmerged reads in any specific tandem re-

peat region is expected to reduce accuracy in the output contigs. Examining all the

contigs assembled with TRA for the percentage of merged reads (good reads) incor-

porated in output contigs, revealed that merged reads accounted for < 60% of all

reads in the majority of bad, weak and missing contigs (Table 3.6). In Table 3.6,

good reads rate refers to the percentage of merged reads in total reads used to assem-

ble each contig. Contig accuracy refers to the percentage of contig matching to refer-

ence. All contigs assembled by TRA are plotted in graph. The blue points are correct

contigs, >98% matching with reference; the light blue points are weak contigs, >98%

matching with reference after trimming; the red points are bad contigs, <98% match-

ing with reference; and pink points are missing contigs, <200 bp in length. All cor-

rect contigs are located at top of the figure. And most of them are in the region where

good reads takes more than 60%.

For the two tandem repetitive regions that TRA failed to assemble contigs, i.e.

missing contigs, only 10% of all covering reads were merged reads. This means that

a major parameter that affects the correctness of TRA-assembled contigs is the avail-

ability of good quality sequence reads that cover the contig regions. On the other

hand, TRA is robust in processing erroneous reads and is able to assembly quality

contigs from sequencing data that allows the generation of > 60% merged reads in all

covering reads for the contig regions. Improvement in sequencing technology would

be expected to raise the efficiency of all assembler tools including TRA.

52

Figure 3.6: Influence of reads quality on assembly.

Contigs assembly with TRA is based on alignment support from paired-end reads,

i.e. every base assembled in a contig represents the majority votes from several reads.

After aligning TRA-assembled contigs to the human reference, the position of each

mismatched base was recorded and vetted against valid SNPs in the dbSNP database.

This revealed that 30.7% of these mismatched bases were verified SNPs. As there is

no evidence that the rest of the mismatch bases are erroneous, the likelihood that the-

se represent SNPs cannot be excluded. In genetic terms, tandem repeat assembly is

particularly prone to losing valuable information of variations. TRA works at the

level of comparative assembly and information on repeat motifs in a reference se-

quence are essential parameters. Repeat motifs for various genomes can be extracted

from repeat databases, such as TRDB and RepBase [45].

Repeat motif also refers to sequence pattern or sequence consensus in these repeat

database. These motifs are identified based on draft genomes or Sanger’s long se-

quencing reads. For example, tandem repeats are identified by Tandem Repeats

53

Finder on assembled genome [130]. Similarly, RECON identifies repeat motifs from

a reference genome [131]. RepeatScout uses Sanger’s long sequence reads to obtain

repeat consensus [48]. Question remains how tandem repeats may be assembled de

novo from short NGS reads generated from new species genomes. In this regard, de

novo creation of repeat motif has been recently proposed using NGS data [132].

When repeat motif creation from NGS reads becomes possible, this TRA platform

can be applied in the discovery tandem repeats from new species without relying as-

sembled genomes.

3.4 Summary

This chapter proposed an innovative method of assembling tandem repeats. We

used repeat motifs from tandem repeat database to select all candidate reads. We then

assemble these reads back into complete tandem repeat. The contig generation is a

dynamic process which updates until a static result is generated.

We compared our TRA method with the-state-of-art methods. The comparison

showed that our TRA method outperformed other methods on a benchmark dataset

of Human chromosome 14.

54

Chapter 4

4 Innovative assembly method of DNA

transposon

4.1 Introduction

Repeat sequences are major components of Eukaryotic genomes. Some repetitive

elements are able to move within the genome by biological mechanism. We call the-

se repetitive elements as transposable elements. They make up a large fraction of

mammalian genome. For example, transposable elements make up 45% of human

genome, and up to 80% of maize. Normally, transposable elements are categorized in

two classes (Figure 4.1). Class I is referred as DNA transposons, they move from one

genomic location to another by a cut-and-paste mechanism. Class II elements are re-

ferred as retrotransposons, as they function via reverse transcription of an RNA in-

termediate. They move from one genomic location to another by a copy-and-paste

mechanism. In this chapter, we propose assembly method on DNA transposons and

in chapter 5, we will present assembly method on retrotransposons.

In existing assembly methods, de Bruijn graph based is most widely used. Velvet,

SOAPdenovo and ABySS2 are typical representatives of de Bruijn graph based as-

55

sembly methods. However, the process of generation de Bruijn graph needs to break

down all NGS reads into k-mers. Thus, the structural features of DNA transposons

are totally ignored. However we think structural feature is unique to DNA transposon,

therefore it cannot be abandoned during assembly process. For a DNA transposon, it

starts with a direct repeat from 5’ to 3’ and followed by an inverted terminal repeat,

while it ends up with an inverted terminal repeat followed by a direct repeat. This

feature can help assembler to find a pair of head part and tail part of the DNA trans-

poson. What we require is a method to fill up the gap between head part and tail part.

Transposase

ITR ITRDR DR

DNA transposons (Class 1)

TSD TSD

Retrotransposons (Class 2)LTR

LTR LTRGag Prt Pol Env

TSD TSD5' UTR

Retrotransposons (Class 2)Non-LTR

ORF 1 ORF 2 3' UTR

Poly-A

Figure 4.1: Types of transposable elements and their structures.

The primary contribution of this work is summarized as follows:

We proposed a dot matrix based DNA transposon feature finding algo-

rithm. This algorithm helps us to select out the pairs of seeds which are

head part and tail of DNA transposons. Furthermore, the use of structural

information provides a new direction in sequence assembly.

56

We proposed a read quality based matching algorithm. This method ex-

tends the head seed and fill up the gap between head seed and tail seed.

Furthermore, it decreases the bad effect of poor quality read to assembly

result.

We conducted a number of experiments to evaluate our DNA transposon

assembly (DTA) method. The results show that our method is superior to

the state-of-the-art methods for assembling DNA transposons.

The rest of this chapter is organized as follows. First, we give definitions to DNA

transposon from the perspective of structural feature, second we propose two meth-

ods to assemble DNA transposon, i.e. DNA transposon feature finding algorithm and

extension algorithm. Third, we evaluate our method by comparing to the-state-of-art

method using benchmark dataset. Finally, we conclude this chapter.

4.2 Method

In this section, we present our innovative algorithm for DNA transposon assembly.

This algorithm will utilize the structure feature of DNA transposon to select out the

pairs of seeds from NGS reads. Each pair of seeds contains a head part and tail part

of a DNA transposon. Then assembly process will combine a sequence quality based

method to gradually extend head seed to meet tail seed. Since this assembly process

is based on two methods, one is the method of seed selection, the other is the method

of quality based extension, we firstly present the method of seed selection, and then

quality based method of seed extension. Finally, we combine these two methods to

form our DNA transposon assembly algorithm.

4.2.1 Overview and basic terms

We formally define basic terms and overview the algorithm.

Definition 4.1 ( ). A short direct repeat sequence is a sequence of

nucleotides from 5’ to 3’, where nucleotide . If and are same,

then the length of is equal to and is same as , where

is the th nucleotide of . DNA transposon has two same , one is located at the

5’ of the sequence, and the other is located at the 3’ of the sequence.

57

Definition 4.2 ( ). An inverted terminal repeat is a

pair of sequences of nucleotides, where nucleotide . If two sequences

and are inverted terminal repeat, then the length of is equal to , and is

same as , where , and is the length of . For an , we denote

as the part located at 5’ of the sequence, and as the other part located at 3’

of the sequence. DNA transposon has a pair of and .

Definition 4.3 ( ). A sequence of nucleotides has a

DNA transposon feature if it has two parts where the part at 5’ has the structure of

and the other part at 3’ has the structure of , where is the op-

erator if , , then , and

.

DNA transposon assembly takes two phases (figure 4.2). Phase I will use NGS

data to select out pairs of seeds using feature finding algorithm, while phase II will

use seed cluster to assemble DNA transposon using extension algorithm. Specifically,

the model will take 4 steps to finish assembly.

1. Feature finding algorithm will scan NGS reads to select out pairs of seeds.

If one read has the feature of and another read has the feature

of , we select these two reads as a pair of seed, where the read

has will be the head seed and the read has will be the

tail seed.

2. For each head seed of a pair of seed, we will get a cluster of NGS reads

which are overlapped no less than 30 base pairs with head seed.

3. A quality based method will choose the most matched read from the clus-

ter to extend the head seed.

4. If the extended sequence is overlapped with tail seed, we will output this

sequence as a DNA transposon. Otherwise, we will get a cluster of NGS

reads which are overlapped no less than 30 base pairs with the extended

sequence. Then we go back to step 3, to use quality based extension meth-

od getting a further extended sequence.

58

It can be seen from the algorithm steps that the assembly process is relied on two

methods, i.e. feature finding method to select pair of seeds and quality based method

to extend the seed.

NGS data

Seeds

Extended

sequence

Feature

finding

algorithm

Extension

algorithm

ConditionDNA transposon Condition met

Condition is not met

Seed cluster

Phase I

Phase II

Figure 4.2: Overview of DNA transposon assembly method.

4.2.2 Seed selection

The seed selection method is utilizing the structural feature of DNA transposon,

i.e. and structure. To determine if two sequence reads have

DNA transposon feature, we will align them using a dot matrix. A dot matrix is con-

59

structed by writing two sequence reads to the top row and leftmost column of a two-

dimensional matrix. A dot is placed at any position where the nucleotide of that row

matches that column. If two same sequences are used to construct dot matrix, the

plotted dots at main diagonal will appear as a line. If two similar sequences are plot-

ted in a dot matrix, then the region that shares significant similarity will appear as a

line parallel to the main diagonal (Figure 4.3). Visually in a dot matrix, a mismatch

in a significant similar region will appear a blank spot off the line. Whereas, an indel

in a significant similar region will dislocate the line into two parallel lines. If the

amount of mismatch and indel is under a threshold which is subject to the length of

matched bases, we say two sequences are similar. How to use a dot matrix to align

two sequences and then determine if they are similar or not is presented in Table 4.1.

A T G C C G A T G A A T G T C

A • • • •

A • • • •

T • • • •

G • • • •

C • • •

C • • •

G • • • •

A • • • •

T • • • •

G • • • •

T • • • •

A • • • •

T • • • •

G • • • •

T • • • •

Figure 4.3: Dot matrix of two sequences.

60

Table 4.1: Dot matrix based alignment method.

1:

2:

3:

4:

5:

6:

7:

8:

9: {

10:

11: {

12: •

13:

14:

15: {

16: •

17:

18:

19:

20:

21:

22: }

23: }

24: }

25:

DNA feature finding algorithm is based on the dot matrix alignment. Given a

pairwise alignment plotted in dot matrix, if one sequence has the structure of

61

and the other one has the structure of , two regions that share

significant similarity will appear. Since is the direct repeat from 5’ to 3’, so one

region will appear as a line parallel to the main diagonal, whereas is the inverted

repeat, so the other region will appear as a line parallel to anti-diagonal (Figure 4.4).

Assume the plotted line parallel to the main diagonal is and the plotted line parallel

to the anti-diagonal is , then is the vertical bisector of . In all pairwise align-

ments, we will select out the pairs of sequences which have DNA transposon features.

We present the algorithm of finding DNA transposon feature in dot matrix in

Table 4.2. Direct repeat is the head and tail of a DNA transposon and it appears as a

plotted line parallel to main diagonal in dot matrix. If the mismatch and indel in di-

rect repeat is under the threshold, then we will go further to check if there is a plotted

line which is parallel to anti-diagonal. We use the row of the last nucleotide in

and the column of the first nucleotide in to locate the spot of . In the direction

of anti-diagonal, if there is a plotted line where mismatch and indel are under the

threshold, then we will select out this pair of sequences as seeds.

Table 4.2: DNA transposon feature finding method.

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12: {

62

13:

14: {

15: •

16

17:

18:

19: {

20: •

21

22:

23:

24:

25:

26:

27: {

28: •

29:

30:

31:

32:

33:

34: }

35:

36:

37:

38: {

39: •

40:

41:

63

42:

43:

44:

45: }

46:

47:

48:

49:

50: }

51: }

52: }

53:

T A T G T A T G C C G A C C A

A • • • •

A • • • •

T • • • •

G • • •

C • • • •

C • • • •

G • • •

A • • • •

T • • • •

G • • •

T • • • •

A • • • •

T • • • •

C • • • •

A • • • •

64

Figure 4.4: Dot matrix of two sequences which have DNA transposon features.

4.2.3 Computing sequence matching score

Every base in a sequence read is told by the color of fluorescent dye labeled to the

sequence. During sequencing process, charge-coupled device (CCD) records a time

serial data, which are colors of each base. Sometimes fragment DNA is not forming a

perfect cluster after amplification, which causes density issue. Subsequently, it leads

the shape and location of the color to be inconsistent and ambiguous. It is the reason

why some bases are called incorrectly and the necessary of quality measurement.

Base-specific quality score was proposed to enhance the accuracy of consensus

sequences during assembly process [134]. Later on, it was quickly replaced by Phred

base calling method, which adopted by the Human Genome Project [135, 136]. Phred

base calling method was widely applied in majority of genome sequencing project,

because it generates 40%-50% less errors than other methods [137]. Now, Phred

quality score is the standard score scheme of measuring the base quality. We present

the scheme of Phred quality score in Table 4.3. The Phred quality score is defined

to be related to base-calling error probability , in equation 4.1. Table 4.3 explains

the meaning of Phred quality score and Figure 4.5 shows the encoding scheme made

by different versions of sequencing technologies.

Table 4.3: The meaning of Phred quality score

Phred quality score Probability of incorrect base call Base call accuracy

10 1 in 10 90%

20 1 in 100 99%

30 1 in 1000 99.9%

40 1 in 10000 99.99%

50 1 in 100000 99.999%

65

!”#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

| | | | |

ASCII 33 59 64 73 104

Sanger 0………………………………………..26……31………………40

Solexa -5…….0………………..9……………………………………………………..40

Illumina

1.3+

0………………..9……………………………………………………..40

Illumina

1.5+

3..………..9……………………………………………………..40

Illumina

1.8+

0………………………………………..26……31………………40

Figure 4.5: Encoding scheme of different sequencing technologies.

Suppose sequence is annotated by nucleotides {

} where

, and another sequence is annotated by

tides{

}, we denote the matching score between two nucleotides

and

as

. Then the equation of nucleotide matching score should be defined

as:

{

( ) (

)

( ( ) (

))

Where is Phred quality score of that nucleotide, is a constant equal

to the maximum value of Phred quality score, and P is a constant served as penalty

for an indel. The Phred quality score is varied in different sequencing technology.

66

Normally in sequence alignment, indel is worse than mismatching. So in experiments,

we set the value of P as -1 or less than -1, to make it a higher penalty than mismatch-

ing.

Based on the nucleotide matching score, sequence matching score will be calcu-

lated as the average matching score of all nucleotides. The equation of sequence

matching score is defined as:

∑

4.2.4 Extension algorithm

By using feature finding algorithm, we obtain pairs of seeds which are heads and

tails of DNA transposons. To connect two seeds into a contig sequence, we propose

an extension algorithm. The main idea is to extend the head seed by using the most

matched sequence, and it stops until the extension part can be aligned by the tail seed.

The extension algorithm starts with finding all related sequence reads which have

at least 30bp overlap to the head seed. Then the matching score will be calculated

for every related sequence read. The sequence read with the highest matching score

will be selected to extend the head seed. We will cut the selected read at the position

which is the last overlapped base pair. Then we connect the tail to the head seed.

Now we have a longer sequence to consider if we need further extension or output as

a result. If the tail seed is not overlapped with this new sequence, we will continue

extension process. Otherwise, we merge the tail seed to the end and output it as the

final result.

We present the extension process in Figure 4.6. For demonstration purpose, we

align all related sequence reads which have at least 3bp overlap to the head seed. Af-

ter calculation of sequence matching score, we have AGAAGTTTCTGA-

GAATGCTTCT as the most matched sequence with score of 0.946. Then we cut off

the overlapping part, and only attach the tail, CTTCT, to the head seed.

67

Figure 4.6: Sequence extension demonstration.

Head Seed

The most matched sequence read with score of 0.946

Cutting position

68

4.3 Results

We evaluate our DNA Transposon assembly method on benchmark dataset, Hu-

man chromosome 14. Since DNA transposon assembly method is focusing on DNA

transposon only, we constructed 8 datasets with different sizes from the original

benchmark dataset. These 8 different datasets contain the NGS reads which are over-

lapped with DNA transposons no less than 20 base pairs. These reads contain the in-

formation of DNA transposon and are actually useful for assembly. On different da-

taset scale, these datasets contain related NGS reads to, 10, 30, 60, 100, 150, 200,

300, and 350 DNA transposons respectively.

To further validate the performance on different species, we also evaluate DNA

transposon assembly method on a real dataset of Maize, downloaded from NCBI. We

also downloaded reference sequence, B73 RefGen_v4 chromosome 8, from NCBI.

This latest reference contains 181.12 Mb sequence in total with 47% GC content. We

downloaded repetitive sequence from Maize Repeat Database which contains DNA

transposons. We constructed 8 datasets with different sizes from the download da-

taset. These 8 datasets contain related NGS reads to, 10, 20, 50, 75, 100, 150, 200,

and 300 DNA transposon respectively.

The performance of an assembly method was measured in terms of the ,

and which are defined as follows. Suppose the sequence of DNA

transposon to assemble is . Let be the number of nucleotides originally on

and be the number of assembled nucleotides for . Let be the number of over-

lapped nucleotides between original sequence and assembled sequence. Then preci-

sion and recall can be defined as:

And based on the precision and recall, we can define f-value as:

For n sequences to be assembled, then precision, recall and f-value will be arithmetic

mean.

69

∑

∑

∑

4.3.1 Parameter setting

The DNA transposon assembly algorithm refers to six parameters: which is

threshold of matched nucleotides in , is the threshold of matched nucleotides in

, is the number of mismatch allowed in , is the number of mismatch

allowed in , is the number of indel allowed in , and is the number of in-

del allowed in . Which value should be set for the parameter and generally

depends on the assembly expectation. For example, if the assembly is expected to

have more candidates as result, and are set to small value accordingly. Which

value should be set for the parameter and generally depends on the mutation

rate. For example, if the assembly is expected to have 1 variation in the , is

set to 1 accordingly.

In our experiments, we tested , , , , , and using a dataset which con-

tains related NGS reads to 50 DNA transposons. We test one parameter while set the

rest as control variable (Table 4.4 - Table 4.9). We use f-value to measure the per-

formance when parameter changes. When f-value reaches peak, the value of that pa-

rameter will be kept (Figure 4.7 - 4.12). In this case, is set to 7, is set to 5, is

set to 1, is set to1, is set to 0, and is set to 0.

70

Table 4.4: Parameter setting on .

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Figure 4.7: Assembly performance on different .

We changed the value of from 4 to 20 when set other parameters fixed. When

equals 7, the reaches peak.


7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

f-va

lue

𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝐷𝑅

71





7 7 7 7 7 7 7 7 7 7 7

5 5 5 5 5 5 5 5 5 5 5

0 1 2 3 4 5 6 7 8 9 10

1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0


0.3

0.4

0.5

0.6

0.7

0.8

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

f-va

lue

𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝐼𝑇𝑅

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐷𝑅

72




7 7 7 7 7 7 7 7 7 7 7

5 5 5 5 5 5 5 5 5 5 5

1 1 1 1 1 1 1 1 1 1 1

0 1 2 3 4 5 6 7 8 9 10

0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0





7 7 7 7 7 7 7 7 7 7 7

5 5 5 5 5 5 5 5 5 5 5

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

0 1 2 3 4 5 6 7 8 9 10

0 0 0 0 0 0 0 0 0 0 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐼𝑇𝑅

73





7 7 7 7 7 7 7 7 7 7 7

5 5 5 5 5 5 5 5 5 5 5

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐷𝑅

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐼𝑇𝑅

74




SOAPdenovo

We compared the assembly performance of our DTA method with three existing

assembly methods that also use NGS reads, but do not target on DNA transposon as-

sembly. The first existing method for comparison is SOAPdenovo method, the se-

cond one is Velvet method, and the last one is ABySS2 method. SOAPdenovo, Vel-

vet and ABySS2 are using de Bruijn graph to assemble DNA sequences. Different

strategies are applied when these three methods generate de Bruijn graph.

None of them utilize structural feature of DNA transposon in their methods.

SOAPdenovo removes erroneous connections on the graph, which including the

steps of clipping tips, removing low-coverage links, resolving tiny repeats and merg-

ing bubbles. Velvet adopts tour bus method to correct errors, including the step of

removing tips, removing bubbles, and removing erroneous connections. ABySS2 us-

es parallel computing in their method, it follows similar steps which are used in de

Bruijn graph based method. For these three methods, the way of assembling DNA

transposon is same as the way of assembling other regions, i.e. they ignore the struc-

tural feature of DNA transposon. To contrast, our DTA method studies the structural

feature of DNA transposon and has an assembly method accordingly.

None of these three existing methods utilize the quality information of NGS reads

during assembly. Even though, some bad reads are removed during the process of de

Bruijn graph generation, it still keeps a lot of noise information during contig genera-

tion. In our DTA method, we use quality to weight the NGS read and select the best

one to assemble DNA transposon.

The comparison of these four methods in terms of precision, recall and f-value

over eight datasets are shown in Figure 4.13 – 4.15 respectively. We can see from

these comparison results that the method using structural feature (i.e. DTA method)

outperformed the methods that ignore structural feature (i.e. ABySS2, Velvet, and

SOAPdenovo). Furthermore, it is obviously that our DTA method outperformed oth-

75

er methods across all evaluation datasets, even at small scale dataset. This demon-

strated that poor quality read will pull low the performance of de Bruijn graph based

methods. However, if we can filter out the bad NGS read and use only the good ones,

the performance of assembly will be increased (i.e. DTA method).


method on Human chromosome 14.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8

Pre

cisi

on

Data set

DTA

ABySS2

Velvet

SOAPdenovo

76


on Human chromosome 14.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8

Rec

all

Data set

DTA

ABySS2

Velvet

SOAPdenovo

77

Figure 4.15: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo meth-

od on Human chromosome 14.

To evaluate the overall effectiveness and performance of the four methods we

drew precision-recall curves for the four methods from their assembly results. These

curves are shown in Figure 4.15, which demonstrates that overall prediction perfor-

mance of DTA method is better than ABySS2, Velvet and SOAPdenovo.

To observe the details of performance comparison, we choose eighth dataset that

contains 350 DNA transposon. We compared DTA method with Velvet, SOAPdeno-

vo and ABySS2 to see how many DNA transposon can be correctly assembled. The

comparison result is shown in Table 4.10. Correct means >98% nucleotides are as-

sembled; weak means >98% nucleotides are assembled with extra erroneous head or

tail; bad means < 98% nucleotides are assembled; missing means failed to assemble.

Weak assembly, bad assembly and missing assembly are also called errors. We can

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8

F-va

lue

Data set

DTA

ABySS2

Velvet

SOAPdenovo

78

see from the comparison result that our DTA method made better assembly than the

other methods.

Figure 4.16: The precision-recall curves of DTA, ABySS2, Velvet, and SOAPdeno-

vo method on Maize chromosome 8.

Table 4.10: Summary of DNA transposons assembled by different assemblers

Assembler Correct Error Weak Bad Missing

DTA 162 138 85 48 5

Velvet 98 202 134 58 10

SOAPdenovo 91 209 140 59 10

ABySS2 134 166 102 56 8

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Pre

cisi

on

Recall

DTA

ABySS2

79


method on Maize chromosome 8.

To validate the consistent performance on different species. We also compare our

DTA method with ABySS2, Velvet and SOAPdenovo on Maize datasets which con-

tain many DNA transposons. The comparison in terms of precision, recall, and f-

value over eight datasets are shown in Figure 4.17 – 4.19 respectively. We can see

from the comparison results that de Bruijn graph based method performed worse

when DNA transposons account for more components in a genome. It is obvious to

tell our DTA method outperformed the other three existing methods across all da-

tasets. Moreover, it shows in transposon rich region, our DTA still consistently per-

formed well. This because when DTA method assemble each DNA transposon is in-

dependent with other regions. So the assembly result will not be affected by the envi-

ronment, i.e. transposon rich region.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pre

cisi

on

Data set

DTA

ABySS2

80


on Maize chromosome 8.

Figure 4.19: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo meth-

od on Maize chromosome 8.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8

Rec

all

Data set

DTA

ABySS2

Velvet

SOAPdenovo

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8

F-va

lue

Data set

DTA

ABySS2

Velvet

SOAPdenovo

81

4.3.3 Distribution and variation

Figure 4.20: Assembly result of MamRep434 mapping to reference.

82

Figure 4.21: Assembly result of 28 DNA transposons mapping to reference.

83


84


85

To observe the distribution and correctness of assembled DNA transposon, we as-

sembled DNA transposon, MamRep434, using DTA and mapped assembled result to

reference genome of Human chromosome 14. Figure 4.20 demonstrates that DTA

correct assembled MamRep434 of which the assembled result can be uniqly mapped

to chromoson14. We also randomly choosed 28 DNA transposons acroos Human

chromosome 14 and mapped assemby results in Figure 4.21. For some DNA

transposons (i.e. tigger3b), the assembled results map to many locations where across

reference genome. If we investigate futher, it is shown that it has a best hit at a spot

with partially can be mapped to many other places. The blue arcs in the figure

represent partially mapping. It implies that repeat motifs are common in DNA

transposons and these repeat motif will confuse most sequence assembler and

decrease the performance. To have a better understanding, we also assembled the

DNA transposons at the beginning part of the human chromosone 14 in Figure 4.22

and Figure 4.23. One dataset contains 103 DNA transposons and the other one

contains 163 DNA transposons. It demonstrates in the figures that our DTA method

has the ability to assemble DNA transposons in a density area.

To determine the variation in DNA transposons, we aligned all our assembled re-

sults to reference genome, GRCh 38, using Blastn. The Blastn command was run

with the parameters, -F F -e 1e-10 –E -1 –V 500 –b 500. By doing this, we have as-

sembled results very likely to have at least one hit with reference, even it is a bad as-

sembly. Then we are programmatically to filter out the alignments which are obvi-

ously beyond the scope. After filtering, for all valid alignments, we annotated inser-

tions and deletions. In our analysis, we define a deletion if more than 10 bp of the

reference genome are not matched by assembled sequences while the upstream and

downstream of the break are fully matched. We define an insertion if more than 10

bp of the assembled sequences is not matched by reference genome while the up-

stream and downstream of the break are fully matched.

In all alignments, we also found some segments of an assembled sequence are ful-

ly matched with reference gnome whereas the rest parts are not. Thus we define a

highly diverged region if more than 20 bp from both assembled sequences and refer-

ence genome are not aligned to each other (Table 4.11). In the experiments of length

distribution of insertion and deletion, we found that over 7 kb of reference sequence

86

was lost, whereas over 5 kb new sequences are in assembled sequence (Table 4.12).

Such result is expected as transposable elements which tend to have variants. Similar

findings were presented in previous experiments of assembling Arabidopsis thaliana

genomes [138].

Table 4.11: Deletion and insertion of different length in chromosome 14.

Deletion Insertion

Variant length (bp) No. Length (bp)* No. Length (bp)*

1 709 709 687 687

2 199 398 203 406

3-4 168 566 161 544

5-8 118 738 115 717

9-16 77 875 72 810

17-32 38 833 26 558

33-64 15 608 13 530

65-128 8 525 8 598

129-256 6 818 4 435

>256 6 1,482 3 454

* Cumulative length of all variants of the class in that row

Table 4.12: Highly diverged regions of different length in chromosome 14.

Highly diverged regions (HDR)

Variant Length (bp) No. Length (bp)*

20-38 4 96

39-76 9 361

77-152 17 952

153-304 18 3216

304-606 11 4562

607-1212 5 5623

>1213 1 2564

87

* Cumulative length of all variants of the class in that row

Particularly in our experiments, we investigated the frequency of indels occurring

in multiple of 3 base pairs. This is to verify if indels in multiple of 3 base pairs are

more frequently occurring than other lengths of indels [139]. We aligned all genes

sequences of chromosome 14 to our assembled results. Then, we retrieved the frac-

tion of indels in different lengths, for coding regions, intergenic regions respectively

and complete coding sequence respectively. It is obviously to find that indels in 1

base pair are most frequently occurring (Figure 4.10). Also, it is worth noting that

indels in 3 base pairs are more likely occurring than other types of indels. Thus, we

verified that indels take place preferentially in multiple of 3 base pairs.

Figure 4.24: Frequency of indel lengths and variation of coding sequence lengths in

chromosome 14.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Frac

tio

n

Length (bp)

complete coding sequences

indels (coding regions)

indels (intergenic regions)

88

4.4 Summary

This chapter proposed an innovative method of assembling DNA transposon

(DTA). We studied the structure of DNA transposon and recognized that the struc-

tural feature can be utilized to assemble DNA transposon. Two novel algorithms are

proposed: DNA transposon feature finding algorithm and extension algorithm. DNA

transposon feature finding algorithm is based on dot matrix. We use dot matrix to

check pairwise alignment where we select out seeds have and

structure. Extension algorithm is based on computing sequence matching score

which is according to NGS read quality. We extend the head seed using the best

matched NGS read until it reaches tail seed.

We compared our DTA method with three existing methods, i.e. Velvet, ABySS2

and SOAPdenovo. The comparison showed that our DTA method outperformed oth-

er three methods not only on Human datasets but also Maize datasets. In addition, we

investigated the distribution of assembled results. Moreover, we discussed the varia-

tion in our assembled results and it showed that transposable elements are tending to

have variations.

89

Chapter 5

5 Innovative assembly method of retro

transposon

5.1 Introduction

In chapter 4, we classified transposable elements into two classes. One class is

DNA transposon, the other one is retrotransposon. In this chapter, we propose a

method on retrotransposon assembly (RA). There are two types of retrotransposon,

i.e. LTR type retrotransposon and non-LTR type retrotransposon. The structures of

two types of retrotransposon are different. For both types of retrotransposon, they

have two target site duplications at each end of the sequence. But for LTR type re-

trotransposon, there are two long terminal repeats in the sequence, while no long

terminal repeat in non-LTR type retrotransposon.

Based on the structural feature, we proposed two methods to find pairs of seeds

for each type of retrotransposon. A pair of seeds has the sequences which represent

the head part and end part of a retrotransposon. To assemble a retrotransposon, we

need a method to fill up the gap between a pair of seeds. The gap is actually a se-

90

quence that we can find in de Bruijn graph which is generated from NGS reads. Re-

cently, simple path generation was proposed to optimize de Bruijn graph. And our

method of filling gap between seeds is based on the simple paths which are generated

from de Bruijn graph.

The rest of the chapter is organized as follows. First, we give definitions to two

types of retrotransposon from the perspective of structural feature. Second, we pro-

pose two methods of finding pairs of seeds of retrotransposon and one method of fill-

ing gap between seeds. Third, we evaluate our method by comparing to the-state-of-

art method using benchmark dataset, and also we test our method by assembling nu-

cleolus-associated domains (NAD). Finally, we conclude this chapter.

5.2 Method

In this section, we present our innovative algorithm for retrotransposon assembly.

This algorithm uses NGS reads to generate simple paths which are obtained from de

Bruijn graph. Not like other de Bruijn graph based method, we also studied the struc-

tural features of retrotransposon, i.e. LTR type and non-LTR type (Figure 4.1). We

utilized the structural features of retrotransposon to select out the pairs of NGS reads

as seeds which represent both ends of the retrotransposon. A pair of seed has one

head seed which is the start of a retrotransposon, and one tail seed which is the end

of a retrotransposon .Then we map seeds on simple paths to fill up the gap between

head seed and tail seed by using a recursive method. Our retrotransposon assembly

algorithm consists of three methods. We firstly present the method of simple path

generation, secondly present the method of seed selection, and lastly present the re-

cursive method of gap filling.

5.2.1 Overview and basic terms

We formally define basic terms and overview the algorithm.

Definition 5.1 ( ). A target site duplication is a se-

quence of nucleotides from 5’ to 3’, where nucleotide . A target site du-

plication always comes with a pair of identical sequences. The length of a target site

duplication ranges from 5 base pairs to 20 base pairs.

91

Definition 5.2 ( ). A long terminal repeat is a sequence

of nucleotides from 5’ to 3’, where nucleotide . A long terminal repeat

is only shown in LTR type of retrotransposon. Comparing to TSD, the length of LTR

can be hundreds of base pairs.

Definition 5.3 ( ). A PolyA is a sequence of nucleotides from 5’ to 3’, where

nucleotide . If the length of a PolyA is 3, then the sequence of PolyA is .

Definition 5.4 ( ). A sequence of nucleotides

has a LTR type retrotransposon feature if it has two parts where the part at 5’ has the

structure of and the other part at 3’ has the structure of ,

where + is the operator that if , , then

.

Definition 5.5 ( , ). A se-

quence of nucleotides has a non-LTR type retrotransposon feature if it has two parts

where the part at 5’ has the structure of and the other part at 3’ has the structure

of , where + is the operator that if , , then

.

Retrotransposon assembly method takes two phases (Figure 5.1). Phase I will use

NGS reads to generate de Bruijn graph and also select out pairs of seeds using fea-

ture finding algorithm, while phase II will use de Bruijn graph to generate simple

paths where all pairs of seeds can find the right paths to fill up the gap. Specifically,

the model will take 3 steps to finish assembly.

1. De Bruijn graph will be generated using NGS reads. Simple paths then ob-

tained from de Bruijn graph.

2. Feature finding algorithm will scan NGS reads to select out pairs of seeds

which either have or . There are two

cases we will select two NGS reads as a pair of seed. One case is that if

one read has the structure of and the other read has the struc-

ture of . The second case is that if one read has the structure of

and the other read has the structure of .

92

3. Pairs of seeds will be applied on simple paths where a recursive method

will be used to find a path between head seed and tail seed. If a valid path

is fond then we will output the path as the assembled sequence.

It can be seen from the algorithm steps that our retrotransposon assembly method

is relied on three methods, i.e. simple path generation method, retrotransposon fea-

ture finding algorithm, and a gap filling algorithm to connect head seed to tail seed.

NGS data

Simple path

Seeds

RetrotransposonPhase I Phase II

de Bruijn

graph

generation

Feature

finding

algorithm

Gap filling

algorithm

LTR feature Non LTR feature

Figure 5.1: Overview of retrotransposon assembly method

5.2.2 De Bruijn graph generation

NGS reads will have overlap with other reads because of the deep sequencing. For

example, the tail of read A (ATGCCGGAA) is overlapped with the head of read B

(CCGGAAGAC). Sequence assemblers are utilizing such information to have short

reads merge to long sequence. However from the data perspective, same sequence in

different reads is redundant information which wastes computational memory. De

Bruijn graph is introduced to solve memory usage bound. For a de Bruijn

graph , a vertex represents a , a sequence of letters, and an edge is

created when two have overlap of letters. In DNA sequence assem-

bly, we denote set including all possible nucleotides from se-

quence reads. For a sequence , denotes its length, is the nucleotide at posi-

tion . Then we denote as a reverse-complement sequence of , and the complement

is defined as

93

So we have

For sequence and sequence , denotes their concatenation. So a de Bruijn

graph of DNA sequence is denoted as , where is an integer controlling

the length of . So, a graph contains vertices which is the set of all

possible , and edges is the set of

where is the substring of from position to .

94

Figure 5.2: De Bruijn graph with k = 24 of sequence ATCGAACGTTTTAC-

CAGCGATTCA.

Therefore, for a particular Illumina paired-end read , two sequences which

are and will be used to construct de Bruijn graph. For a simple demonstration, a

sequence read ATCGAACGTTTTACCAGCGATTCA will be generated into a de

Bruijn graph with , in Figure 5.2.

So far, we have introduced the definition of de Bruijn graph of DNA sequences. A

vertex from the graph represents a which is obtained from sequence reads.

Because of the high coverage during sequencing process, a can be con-

tained by many sequence reads. Especially for the from repetitive regions,

its frequency will be higher than the average. However, some ’s frequency

95

is lower than the average. This could be caused by the low coverage during sequenc-

ing process or erroneous base calling during sequence read process. To generate a

quality de Bruijn graph, we want to exclude of which frequency are lower

than a cutoff, and only keep those which are qualified. Given a family of reads

, we denote as the frequency of a , namely the number of

times it existing in . For a cutoff , a , is qualified to generate de

Bruijn graph when .

5.2.3 Simple path algorithm

Now, we have a quality set of ready to generate de Bruijn graph. We

need to consider how we can simplify the structure of de Bruijn graph. Specifically,

we want to decrease the amount of vertexes in graph . Given a set

of , we denote as , are compactable if , , ,

if then . The compaction process is conducted on a pair of

ble . If , then it replaces , as

where is the string concatenation operator and is the -th letter of . To

execute compaction process, we introduce the concept of minimizer which is first

introduced by [140]. The -minimizer of a is the substring of in length

of . We define to be the -minimizer of the -prefix of

and to be the -minimizer of the -suffix of . For two

and , if , then . Given a set of and

parameter , we will first obtain all -minimizers and their frequencies. Based on the

frequency, all -minimizers are sorted in ascending order. Afterwards, from the low-

est frequent -minimizer to the highest, all will be assigned to an associat-

ed -minimizer. We denote to be a container to hold its associated and

compacted sequences. The compaction process will start with the which

are associated to the lowest -minimizer. Following a greedy process,

within a will be -compacted and the results will be temporarily put in . If a

sequence in is -computable with a higher frequent -minimizer , then it will

96

be allocated to the associated container , to have a further compaction. Finally,

compaction process stops after all sequences in highest finishing compaction.

Table 5.1: Simple path generation method.

1:

2:

3:

4:

5:

6:

7: {

8:

9: {

10:

11:

12: }

13: }

14:

15:

16:

17: {

18:

19: {

20:

21:

22:

23: }

24:

25: {

97

26:

27:

28:

29:

30:

31:

32:

33:

34: }

35:

36: }

In Figure 5.3, we break down the process of simple path generation. In phase A,

we have four 3-mers, {GGA, GAG, AGT, AGG}. In phase B, we obtain all 2-

minimizers from 3-mers and order them by ascending way according to frequency. In

phase C, we start compaction process from associated to GT which is the lowest

frequent 2-minimizer. Since AGT is the only 3-mer in , it will be directly put

in and then be allocated to which holds the next higher 2-compactable 3-

mers. Following the order, 3-mers in , {GGA, AGG}, will be compacted into

AGGA and then be allocated to . In the end, all sequences in will be com-

pacted and output result in , as {AGT, CAACA}.

A:

AGTGGA

AGG

GAG

B:

GT GG GA AG

1 2 2 3

98

C:

Step 1 2 3 4

GT

GG GA AG

AGT

GGA

AGG

GAG

AGGA

AGT

AGGAG

AGT

GGA

AGG

GAG

AGGA

AGT

AGGAG

AGT

AGGA

AGGAG

AGT

AGGAG

Figure 5.3: Demonstration of simple path generation process.

5.2.4 Select LTR feature seeds

We use structural feature of LTR type retrotransposon to select LTR seeds, i.e.

structure and structure. Every two NGS reads will be plot-

ted in a dot matrix which is introduced in chapter 4. If one read has the structure of

and the other read has the structure of , then they will be se-

lected out as a pair of seed, where the read has structure will be head

seed and the read has structure will be tail seed.

For a pairwise alignment in dot matrix, if one read has the structure of

, and the other read has the structure of , two regions that share sig-

nificant similarity will appear (Figure 5.4). Visually, these two regions will appear as

two parallel lines which are parallel to main diagonal. Ideally, in the assumption of

no mismatch and no indel, the plotted two parallel liens will share one vertical bisec-

tor. Target site duplication is the head and tail of a retrotransposon, it appears as a

plotted line parallel to main diagonal in dot matrix. The longer a line is the more

matches two reads will have. A mismatch will show a blank spot off the line while an

99

indel will dislocate the line in the way of moving up one position or moving down

one position. If the plotted line is long enough to be a , and mismatch and indel

are under threshold, we will go further to check if there is another line parallel to

.

T T C G A G C A A T C C T A C

A • • • •

A • • • •

T • • • •

C • • • • •

T • • • •

C • • • • •

G • •

A • • • •

G • •

C • • • • •

A • • • •

G • •

C • • • • •

G • •

T • • • •

Figure 5.4: Dot matrix of two sequences which have LTR features.

We present the method of finding LTR feature in dot matrix in Table 5.2. In the

demonstration, we assume the read has structure is written to the left-

most column of the dot matrix while the read has structure is written to

the top row of the dot matrix. The algorithm starts scanning the dot matrix to find a

plotted line which represents . If the thresholds of match, mismatch and indel for

are satisfied, we will go further. Then we continue scanning to find another

100

plotted line from the next row where ends. If the thresholds of match, mismatch

and indel for are satisfied, we will select these two reads as a pair of seed.

Table 5.2: Long terminal repeat feature finding method

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12: {

13:

14: {

15: •

16

17:

18:

19: {

20: •

21

22:

23:

24:

101

25:

26: {

27: •

28:

29:

30:

31:

32:

33: }

34:

35:

36: }

37: }

38: }

39:

5.2.5 Select Non-LTR feature seeds

There is no long terminal repeat in non-LTR type retrotransposon. So this type of

retrotransposon is referred as Non-LTR. For Non-LTR at 3’ end, there is a

track in front of . For a pairwise alignment in dot matrix, if one read has the

structure of and the other read has the structure of , one region

that shares significant similarity will appear. Since TSD is from 5’ to 3’, so this re-

gion will appear as a line parallel to main diagonal. Visually, from the perspective of

starting location, this plotted line will start nearby a track (Figure 5.5).

To find a non-LTR feature in dot matrix, we need a method to find track

first. We present a simple repeat finding method in Table 5.3. Simple repeat finding

is a method which takes a DNA sequence and outputs the length and start location of

each track. This method implements a greedy algorithm. Specifically, given a

DNA sequence, for example, Chromosome 1 of Human Genome, simple repeat find-

ing method will traverse from the first nucleotide to the last nucleotide one by one.

102

Whenever there is a track of which length is longer than a threshold, it will

output the start location and length.

C G T C A A A A T G C G C A C

G • • •

T • •

G • • •

C • • • • •

G • • •

C • • • • •

A • • • • •

C • • • • •

T • •

A • • • • •

T • •

G • • •

C • • • • •

G • • •

T • •

Figure 5.5: Dot matrix of two sequences which have Non-LTR features.

For non-LTR type retrotransposon, we are only interested in the pairs of seeds

which contain track. So for a pairwise alignment in dot matrix, we will use

simple repeat finding method to locate the position of track. If a PolyA track

is found, then we will scan nearby region to find a plotted line which is parallel to

main diagonal. We present the non-LTR feature finding method in Table 5.4. In the

demonstration, we assume the read has structure is written to the top

row and the read has structure is written to the leftmost column. We use last

column of the track to locate the beginning of . If a long plotted line is

103

fond and the thresholds of match, mismatch and indel for are satisfied, we will

select these two reads as a pair of seed.

Table 5.3: Simple repeat finding method

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

Table 5.4: Non – LTR feature finding method

1:

2:

104

3:

4:

5:

6:

7:

8:

9:

10:

11: {

12:

13: {

14: •

15:

16:

17: {

18: •

19:

20:

21:

22:

23:

24: }

25: }

26: }

27:

5.2.6 Gap filling method

By using LTR feature finding method and non-LTR feature finding method, we

obtain pairs of seeds which are heads and tails of retrotransposon. To fill up the gap

105

between head seed and tail seed, we apply a recursive algorithm to find a valid path

on simple paths which are generated from de Bruijn graph. The idea of assembling

retrotransposon is presented in Figure 5.6. The head part and tail part of retrotrans-

poson are identified. The middle regions of retrotransposon are represented in simple

paths. For example, for non-LTR type retrotransposon, we have a pair of seeds and

identified from feature finding method where is the head seed and is the tail

seed. The middle regions of the non-LTR type retrotransposon are 5’ UTR, ORF1,

ORF2 and 3’UTR which are simple paths generated from de Bruijn graph. At same

time we traverse the first level of right nodes of and the first level of left nodes of

c’, if two nodes founded are cascade, the method will stop traversing and merge the

path as a retrotransposon, otherwise the method will continue on the second level of

nodes until a valid path found. The method is presented in Table 5.5.

TSD TSD

RetrotransposonsLTR

LTR LTRGag Prt Pol Env

TSD TSD5' UTR ORF 1 ORF 2 3' UTR

Poly-A

A

BC

Simple Path sequences

B'

C'

A'

RetrotransposonsNon-LTR

Figure 5.6: Demonstration of contig generation of retrotransposon

106

Table 5.5: Contig generation method

1:

2:

3:

4:

5: {

6: ( )

7:

8:

9:

10: {

11:

12: {

13:

14: {

15:

16: }

17: }

18: }

5.3 Result

We evaluate our retrotransposon assembly method on benchmark dataset, Human

chromosome 14.We constructed 8 datasets with different sizes from the original

benchmark dataset. These 8 different datasets contain the NGS reads which are over-

lapped with retrotransposons no less than 20 base pairs. These reads contain the in-

107

formation of retrotransposons and are actually useful for assembly. On different da-

taset scale, these datasets contain related NGS reads to, 10, 20, 30, 40, 50, 60, 70,

and 100 retrotransposons.

Suppose the retro-transposon pretended as unknown is . Let be the number of

retro-transposon originally on reference genome and be the number of assembled

sequences for . Let be the overlap between original reference set and assembled

set. Then precision and recall can be defined as:

And based on the precision and recall, we can define f-value as:

For n sequences to be assembled, then precision, recall and f-value will be arith-

metic mean.

∑

∑

∑

5.3.1 Parameter setting

For LTR type retrotransposon, the assembly method refers to six parameters:

which is the threshold of matched nucleotides in , which is the threshold of

matched nucleotides in , which is the number of mismatch allowed in ,

which is the number of mismatch allowed in , which is the number of indel

allowed in , and which is the number of indel allowed in . In our experi-

108

ments, we tested , , , , , and using a dataset which contains related NGS

reads to 50 LTR type retrotransposons. We test one parameter while set the rest as

control variable (Table 5.6 - Table 5.11). We use f-value to measure the performance

when parameter changes. When f-value reaches peak, the value of that parameter

will be kept (Figure 5.7 - 5.12). In this case, is set to 8, is set to 24, is set to 0,

is set to2, is set to 0, and is set to 1.


4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Figure 5.7: Performance of retrotransposon assembly method on different .



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

f-va

lue

𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝑇𝑆𝐷

109


8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1




0.575

0.58

0.585

0.59

0.595

0.6

0.605

0.61

0.615

0.62

0.625

0.63

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

f-va

lue

𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝐿𝑇𝑅

110


8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1




0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝑇𝑆𝐷

111


8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1




0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐿𝑇𝑅

112


8 8 8 8 8 8 8 8 8 8 8

24 24 24 24 24 24 24 24 24 24 24

0 0 0 0 0 0 0 0 0 0 0

2 2 2 2 2 2 2 2 2 2 2

0 1 2 3 4 5 6 7 8 9 10

1 1 1 1 1 1 1 1 1 1 1





8 8 8 8 8 8 8 8 8 8 8

24 24 24 24 24 24 24 24 24 24 24

0 0 0 0 0 0 0 0 0 0 0

2 2 2 2 2 2 2 2 2 2 2

0 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝑇𝑆𝐷

113




For non-LTR type retrotransposon, the assembly method refers to four parame-

ters: which is the threshold of matched nucleotides in , which is the length

threshold of track, which is the number of mismatch allowed in , and

which is the number of indel allowed in . We tested , , and using a dataset

which contains related NGS reads to 50 non-LTR retrotransposons. We test one pa-

rameter while set the rest as control variable (Table 5.12 - Table 5.15). We use f-

value to measure the performance when parameter changes. When f-value reaches

peak, the value of that parameter will be kept (Figure 5.13 - 5.16). In this case, is

set to 9, is set to 5, is set to 0 and is set to 0.


3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐿𝑇𝑅

114





9 9 9 9 9 9 9 9 9 9 9 9 9

2 3 4 5 6 7 8 9 10 11 12 14 15

0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

f-va

lue

𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠, 𝑡

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2 3 4 5 6 7 8 9 10 11 12 13 14 15

f-va

lue

𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑝𝑜𝑙𝑦𝐴 𝑡𝑟𝑎𝑐𝑘, 𝑙

115




9 9 9 9 9 9 9 9 9 9 9

5 5 5 5 5 5 5 5 5 5 5

0 1 2 3 4 5 6 7 8 9 10

0 0 0 0 0 0 0 0 0 0 0





6 6 6 6 6 6 6 6 6 6 6

5 5 5 5 5 5 5 5 5 5 5

0 0 0 0 0 0 0 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑, 𝑚

116





SOAPdenovo

We compared the assembly performance of our RA method with three existing as-

sembly methods that also use NGS reads, but do not target on DNA transposon as-

sembly. The first existing method for comparison is SOAPdenovo method, the se-

cond one is Velvet method, and the last one is ABySS2 method. SOAPdenovo, Vel-

vet and ABySS2 are also using de Bruijn graph to assemble DNA sequences. In

comparison, our RA method is based on simple paths which are generated from de

Bruijn graph.

None of the three existing methods utilize the structure feature of retrotransposon,

while, our RA method studied structural features of two types of retrotransposon, i.e.

LTR type retrotransposon and non-LTR type retrotransposon. SOAPdenovo and

Velvet methods use different strategies to obtain consensus contig sequence from de

Bruijn graph. ABySS2 method adopted parallel computing strategy to assemble se-

quences based on de Bruijn graph. These three existing methods are genome wide

assembler which has the same assembly strategy for any type sequence.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 1 2 3 4 5 6 7 8 9 10

f-va

lue

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑, 𝑖

117

The comparisons of these four methods in terms of precision, recall and f-value

over eight datasets are shown in Figure 5.17 – 5.19 respectively. We can see from

these comparison results that method using structural feature (i.e. RA method) out-

performed the methods that ignore structural feature (i.e. ABySS2, Velvet, and

SOAPdenovo). We also noticed that our RA method is superior to other methods

across all datasets. The big performance gap in terms of precision, recall and f-value

indicates that the method dedicated to retrotransposon (i.e. RA method) outper-

formed the methods which treat retrotransposon as common sequence (i.e. ABySS2,

Velvet, and SOAPdenovo).

Retrotransposon follows a “cut-and-paste” mechanism, which means one re-

trotransposon will have many copies across the genome. To have a detailed look our

RA method, we assembled one LTR type retrotransposon, i.e. PABL_A, two non-

LTR type retrotransposon, i.e. AluSx and L1PB. We mapped the assembly results to

the reference sequence of Human chromosome 14, in Figure 5.20-5.22. Every arc in

the figure represents a mapping between one location at reference and one assembled

result from RA method. As we can see from the mapping in Figure 5.20-5.22, for

both LTR type retrotransposon and non-LTR type retrotransposon, our RA mehtod

has the ability to assemble different copies which are distributed across the

chromosome.

Figure 5.17: Precision comparison of RA, ABySS2, Velvet, and SOAPdenovo meth-

od on Human chromosome 14.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8

Pre

cisi

on

Data set

RAABySS2VelvetSOAPdenovo

118

Figure 5.18: Recall comparison of RA, ABySS2, Velvet, and SOAPdenovo method


Figure 5.19: F-value comparison of RA, ABySS2, Velvet, and SOAPdenovo method


0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1 2 3 4 5 6 7 8

Rec

all

Data set

RA

ABySS2

Velvet

SOAPdenovo

0.05

0.15

0.25

0.35

0.45

0.55

0.65

0.75

0.85

0.95

1 2 3 4 5 6 7 8

F-va

lue

Data set

RAABySS2VelvetSOAPdenovo

119

Figure 5.20: Assembly result of PABL_A (LTR type retrotransposon) mapping to

reference.

120

Figure 5.21: Assembly result of AluSx (SINE, non-LTR type retrotransposon)

mapping to reference.

121

Figure 5.22: Assembly result of L1PB (LINE, non-LTR type retrotransposon)

mapping to reference.

122

5.3.3 Assemble nucleolus associated domain using RA method

To test the performance of our RA method in large dataset, i.e. genome wide as-

sembly, we experimented our method on a nucleolus-associated domains (NAD) da-

taset which contains NGS reads across Human genome. NAD was previously re-

vealed to have rich contents of repetitive sequences. The sequence of these domains

is not revealed and it is not suitable to assemble those using existing methods which

performed bad on repetitive sequences. Here, we report the sequence of NADs in

Human Genome by using our RA method. NAD domains was assembled and then

mapped against human genome. Here we report the assembled sequences and our

analysis based on assembled result.

Figure 5.23: Simple path of nucleolus-associated domains.

In our experiments of assembling NAD, we use 27mer to generate de Bruijn graph

from which we obtain simple paths. For all simple path of which the length is longer

than 101 base pairs, we map them to the reference of Human genome. With default

setting of blastn, we found that in total of 4216652 simple paths, we have 4607768

3685520

446919

51120

8906

3341

1339

674

471 272

230

182

102 63

53 41

38 16 16 16 12 11 9 7 9 2

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

75% 80% 85% 90% 95% 100%

Am

ou

nt

Match rate againt GRCh38

123

hits with reference genome, i.e. GRCh38. 4199369 simple paths have at least one hit

with GRCh38. Among 4607768 hits, 4403006 hits has more than 97% match rate

with reference genome. However there are 17283 simple paths have no hit with the

reference (Figure 5.23). In Figure 5.23, we present the amount of simple path on dif-

ferent match rate against the reference gnome. It is noticed that majority of the sim-

ple paths have very high match rate, which suggest that assembled simple paths are

in high quality.

Then we applied our retrotransposon assembly method on simple paths, we as-

sembled 203637 sequences which have more than one hit with the reference genome.

These assembled sequences have more than 97% match rate against human genome

in multiple locations. Specifically, 156986 sequences have more than 50 hits and

9216 sequences have more than 200 hits. In Table 5.16, for each chromosome of

Human genome, we list the longest NAD assembled, total length of NAD and the

percentage of NAD in the chromosome. We found that about one thirds of Human

genome are associated to NAD.

124

Table 5.16: Statistics of assembled NAD in Human genome

Chromosome Longest NAD (bp) NAD bases (bp) NAD / Whole (%)

1 3173 90588466 36.3873 2 2285 76462033 31.5706 3 1874 64788486 32.6727 4 1373 49320570 25.9289 5 4251 69184119 38.1099 6 1998 54897016 32.14 7 1641 59016522 37.0367 8 3515 54017426 37.2178 9 2782 45878913 33.1508

10 1841 50006735 37.375 11 1474 44011019 32.5799 12 2485 52561713 39.4384 13 2320 33925276 29.6642 14 6991 30119144 28.1372 15 3677 39021421 38.2596 16 2163 33018295 36.5496 17 3561 29536927 35.4766 18 1611 28772173 35.7982 19 1472 23012482 39.2586 20 1529 24784924 38.4595 21 3257 20144623 43.127 22 1627 15168924 29.8492 X 4035 52184043 33.4425 Y 4035 5514692 9.6365

We investigated what genes get involved in NAD and how these genes distributed

on 24 chromosomes. We aligned all assembled sequences to human genome. For

every hit of more than 97% match rate, we record the start location and end location.

If the region covers a gene sequence, we highlight it with a blue horizontal line in

chromosome ideogram (Figure 5.24). In Figure 5.24, we demonstrate all genes that

are related to NAD. We found that these genes are widely distributed across 23

chromosomes, but only 7 related gene found in chromosome Y. The detailed infor-

mation of how many genes in NAD are listed in Table 5.17.

To learn what functions NAD will play in Human genome, we annotated these re-

lated genes using GO, an annotation of gene function. In Table 5.18, we listed 15

most significant genes which have functions only in NAD. PARK7 is found that 34

functions are only acted in NAD. Our analysis showed that NAD genes take part in

125

specific biological processes and molecular functions (Figure 5.19). For example,

function of translation repressor activity is only related to NAD. However, some

gene functions are not performed in NAD (Figure 5.20). For example, vitamin D

metabolic process is only functioned outside of NAD.

Table 5.17: No of gene in NAD

Chromosome No. of gene in NAD Chr1 1049 Chr2 712 Chr3 615 Chr4 372 Chr5 556 Chr6 505 Chr7 531 Chr8 370 Chr9 402

Chr10 447 Chr11 589 Chr12 558 Chr13 211 Chr14 334 Chr15 398 Chr16 460 Chr17 535 Chr18 172 Chr19 694 ChrX 403 ChrY 7

126

Figure 5.24: NAD gene distribution.

127

Table 5.18: Genes that have functions only in NAD.

Gene Name No. of Function In NAD No. of Total Function PARK7 34 120

CTNNB1 16 191 GATA3 16 118 AQP1 15 92

AKR1C3 14 62 LRRK2 12 120 XCL1 12 42

ADIPOQ 11 87 SIRT1 11 143 ITGAV 11 68

SERPINA5 11 31 SOX9 11 118 PAOX 11 20 PTK2B 10 124 MSH2 10 58

Table 5.19: 10 gene functions only associated with NAD

Gene Ontology Frequency Description Category

GO:0052697 9 xenobiotic glucuronidation biological_process

GO:0030371 9 translation repressor activity molecular_function

GO:1904224 8 negative regulation of glucu-

ronosyltransferase activity

biological_process

GO:2001030 8 negative regulation of cellular

glucuronidation

biological_process

GO:0004908 7 interleukin-1 receptor activity molecular_function

GO:0060100 7 positive regulation of phagocy-

tosis, engulfment

biological_process

GO:0004703 7 G-protein coupled receptor ki-

nase activity

molecular_function

GO:0010529 7 negative regulation of transpo-

sition

biological_process

GO:1901018 6 positive regulation of potassium

ion transmembrane transporter

activity

biological_process

GO:0034351 6 negative regulation of glial cell

apoptotic process

biological_process

128

Table 5.20: 10 gene functions not associated with NAD

Gene Ontology Frequency Description Category GO:0042359 10 vitamin D metabolic process biological_process

GO:0050795 9 regulation of behavior biological_process

GO:0072321 8 chaperone-mediated protein

transport

biological_process

GO:0003073 7 regulation of systemic arterial

blood pressure

biological_process

GO:0009629 7 response to gravity biological_process

GO:0046785 7 microtubule polymerization biological_process

GO:0031987 7 locomotion involved in locomo-

tory behavior

biological_process

GO:0008064 7 regulation of actin polymerization

or depolymerization

biological_process

GO:0006561 6 proline biosynthetic process biological_process

GO:0030816 6 positive regulation of cAMP met-

abolic process

biological_process

5.4 Summary

This chapter proposed an innovative method of assembling retrotransposon (RA).

Via dot matrix, we studied structural features of both types of retrotransposon, i.e.

LTR type retrotransposon and non-LTR type retrotransposon. Three novel algo-

rithms are proposed: LTR feature finding algorithm, non-LTR feature finding algo-

rithm and recursive gap filling algorithm. We used two feature finding algorithms to

find the pairs of seeds of retrotransposon, then we apply recursive gap filling algo-

rithm to assemble the retrotransposon.

We compared our DTA method with three existing methods, i.e. Velvet, ABySS2

and SOAPdenovo. The comparison showed that our DTA method outperformed oth-

er three method on a benchmark dataset. We also use our RA method to assemble

nucleolus-associated domains which is a genome wide dataset. Downstream analysis

on assembled sequence indicates we have successfully assembled 203637 retrotrans-

posons.

129

Chapter 6

6 Conclusion and Future work

This chapter reviews the whole thesis and summarize its main contribution. Further-

more, we discuss the potential research topics for the future work.

6.1 Contributions

This thesis proposes the methods of assembling repetitive DNA sequence using

next generation sequencing reads. Theoretical and experimental results have led to

the conclusion and main contribution of this thesis. These are:

Repeats are playing crucial roles in genome evolution and they are be-

lieved to be associated with diseases. To successfully assemble these re-

gion will help geneticists conduct downstream analysis. This thesis ad-

dressed such requirement. We reviewed existing method of sequence as-

sembly in chapter 2. Previous methods focus on genome wide assembly

while our proposed methods are focusing on assembling repetitive DNA

sequences. In this thesis, we proposed three methods to assemble three dif-

ferent types of repetitive DNA sequences, i.e. tandem repetitive DNA se-

quence, DNA transposon and retrotransposon.

We proposed a novel method to assemble tandem repetitive sequence. Eu-

karyotic genomes contain high volumes of intronic and intergenic regions

130

in which repetitive sequences are abundant. These repetitive sequences

represent challenges in genomic assignment of short read sequences gen-

erated through next generation sequencing and are often excluded in anal-

ysis losing invaluable genomic information. We presented a method in

chapter 3, known as TRA, for the assembly of tandem repetitive sequences

using paired-end read. In previous methods, tandem repetitive sequences

are not correctly assembled due to unsuitable strategy. By selecting out all

related reads to a tandem repat, our proposed method can dynamically as-

semble the sequence until we get a static result.

We studied the classification of transposable elements. There are two

types of transposable elements, i.e. DNA transposon and retrotransposon.

We proposed a novel method to assemble DNA transposon in chapter 4.

We investigated the structure of DNA transposon and use structural fea-

ture to select out pairs of seeds which represent the head and tail of DNA

transposons. We also proposed a read quality based method which extends

the head seed to tail seed. Performance comparison on benchmark dataset

shows our DTA method is superior to other three existing methods on as-

sembling DNA transposon

We further proposed a novel method to assemble retrotransposon in chap-

ter 5. There are two types of retrotransposon, i.e. LTR type retrotranspos-

on and non-LTR type retrotransposon. We studied the structures of both

types of retrotransposon. For each type of retrotransposon, we have meth-

ods to select out pairs of seeds which represent head seed and tail seed of

retrotransposons. We then use a recursive algorithm to find a path between

head seed and tail seed from simple paths which are generated from de

Bruijn graph. We tested our RA method on a genome wide dataset which

contains NGS reads of nucleolus associated domains. The assembled re-

sults real that our RA method has the ability to assemble repeats rich re-

gions.

131

6.2 Future works

Although the proposed methods have some success on assembling repetitive DNA

sequences, there remains several problems that need to be addressed.

De novo assembly of tandem repeats: Our proposed method of assembling

tandem repetitive DNA sequence is based on a repeat pattern which was

previously revealed. In other words, we need a reference sequence to start

assembly process. This is suitable to the species which are previously as-

sembled and perfect for resequencing project. However, for the sequenc-

ing project on new species which have no reference, we still need a meth-

od to assemble tandem repetitive sequences.

Further investigation on non-LTR type retrotransposon: non-LTR type re-

trotransposon can be further classified into short interspersed nuclear ele-

ments (SINE) and long interspersed nuclear elements (LINE). In terms of

length, SINEs are from 50-500 base pairs long and LINEs are about 7000

base pairs long. In this thesis, we have not answered the question of what

assembly strategy should be applied to SINE and LINE. Our work at chap-

ter 5 is still at the very beginning stages.

To combine repeat assembly modules in whole genome assembly: Whole

genome assembly methods always failed at assembling repetitive DNA

sequences while our proposed methods are targeting on assembling differ-

ent types of repetitive DNA sequences. So the problem remains on how to

combine two aspects together into a model which is able to assemble

whole genome sequences with good performance on repeat regions.

Downstream analysis on assembled repetitive DNA sequences: In this the-

sis, we attempted several studies on assembled results. For example, varia-

tion validation in chapter 3 and chapter 4. However, these analyses are still

at the very beginning stages.

132

Bibliography

[1] Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-

terminating inhibitors, Proceedings of the National Academy of Sciences of the

USA, 74, 12, 5463-5467

[2] Collins, F.S., Morgan, M. and Patrinos, A. (2003) The Human Genome Project:

lessons from large-scale biology, Science, 300, 5617, 286-290

[3] Campagna, D., Romualdi, C., Vitulo, N., Favero, M.D., Lexa, M., Cannata, N.

and Valle, G. (2005) RAP: a new computer program for de novo identification

of repeated sequences in whole genomes, Bioinformatics, 21, 5, 582-588

[4] Feschotte, C., Jiang, N. and Wessler, S.R. (2002) Plant transposable elements:

where genetics meets genomics, Nature Review Genetics, 3, 329-341

[5] Treangen, T.J. and Salzberg, S.L. (2012) Repetitive DNA and next-generation

sequencing: computational challenges and solutions, Nature Review Genetics,

13, 36-46

[6] Lefebvre, A., Lecroq, T., Dauchel, H. and Alexandre, J. (2003) FORRepeats:

detects repeats on entire chromosomes and between genomes, Bioinformatics,

19, 3, 319-326

[7] Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S.,

Goldman, M., Barber, G.P., Clawson, H., Coelho, A., Hillman-Jackson, J., Hsu,

F., Kirkup, V., Kuhn, R.M., Learned, K., Li, C.H., Meyer, L.R., Pohl, A.,

Raney, B.J., Rosenbloom, K.R., Smith, K.E., Haussler, D. and Kent, W.J.

133

(2011) The UCSC Genome Browser database: update 2011, Nucleic Acids Re-

search, 39: D876-D882

[8] Alkan, C., Sajjadian, S. and Eichler, E.E. (2011) Limitation of next-generation

genome sequence assembly, Nature Methods, 8, 1, 61-65

[9] Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-

terminating inhibitors, Proceedings of the National Academy of Sciences of the

USA, 74, 12, 5463-5467

[10] Anderson, S. (1981) Shotgun DNA sequencing using cloned DNase l-generated

fragments, Nucleic Acids Research, 9 (13): 3015-27

[11] Altshuler, D., Pollara, V., Cowles, C., Van, E.W., Baldwin, J., Linton, L. and

Lander, E. (2000), An SNP map of the human genome generated by reduced

representation shotgun sequencing, Nature, 407, 513-516

[12] Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos,

M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., Kerlavage, A.R,

McCombie, W.R. and Venter, J.C. (1991) Complementary DNA sequencing:

expressed sequence tags and human genome project. Science, 252, 1651-1656

[13] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben,

L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L.,

Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Ir-

zyk, G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.B.,

Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J.,

Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., Mckenna, M.P., Myers,

E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T.,

Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., To-

masz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P.,

134

Yu, P., Begley, R.F. and Rothberg J.M. (2005) Genome sequencing in mir-

crofabricated high-density politer reactors, Nature, 437, 376-380

[14] Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J.,

Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R. Boutell, J.M.,

Bryant, J., Carter, R.J., Cheetham, R.K., Cox, A.J., Ellis, D.J., Flatbush, M.R.,

Gormley, N.A., Humphray, S.J., Irving, L.J., Karbelashvili, M.S., Kirk, S.M.,

Li, H., Liu, X., Maisinger, K.S., Murray, L.J., Obradovic, B., Ost, T., Parkinson,

M.L., Pratt, M.R., Rasolonjatovo, I.M.J., Reed, M.T., Rigatti, R., Rodighiero,

C., Ross, M.T., Sabot, A., Sankar, S.V., Scally, A., Schroth, G.P., Smith, M.E.,

Smith, V.P., Spiridu, A., Torrance, P.E., Tzonev, S.S., Vermaas, E.H., Walter,

K., Wu, X., Zhang, L., Alam, M.D., Anastasi, C., Aniebo, I.C. Bailey, D.M.D.,

Bancarz, I.R., Banerjee, S., Barbour, S.G., Baybayan, P.A., Benoit, V.A., Ben-

son, K.F., Bevis, C., Black, P.J., Boodhun, A., Brennan, J.S., Bridgham, J.A.,

Brown, R.C., Brown, A.A., Buermann, D.H., Bundu, A.A., Burrows, J.C.,

Carter, N.P., Castillo, N., Catenazzi, M.C.E., Chang, S., Cooley, R.N., Crake,

N.R., Dada, O.O., Diakoumakos, K.D., Dominguez-Fernandez, B., Earnshaw,

D.J., Egbujor, U.C., Elmore, D.W., Etchin, S.S., Ewan, M.R., Fedurco, M., Fra-

ser, L.J., Fajardo, K.V.F., Furey, W.S., George, D., Gietzen, K.J., Goddard,

C.P., Golda, G.S., Granieri, P.A., Green, D.E., Gustafson, D.L., Hansen, N.F.,

Harnish, K., Haudenschild, C.D., Heyer, N.I., Hims, M.M., Ho, J.T., Horgan,

A.M., Hoschler, K., Hurwitz, S., Ivanov, D.V., Johnson, M.Q., James, T., Jones,

T.A.H., Kang, G.D., Kerelska, T.H., Kersey, A.D., Khrebtukova, I., Kindwall,

A.P., Kingsbury, Z., Kokko-Gonzales, P.I., Kumar, A., Laurent, M.A., Lawley,

C.T., Lee, S.E., Lee, X., Liao, A.K., Loch, J.A., Lok, M., Luo, S., Mammen,

R.M., Martin, J.W., McCauley, P.G., McNitt, P., Mehta, P., Moon, K.W., Mul-

lens, J.W., Newington, T., Ning, Z., Ng, B.L., Novo, S.M., O’Neill, M.J., Os-

borne, M.A., Osnowski, A., Ostadan, O., Paraschos, L.L., Pickering, L., Pike,

A.C., Pike, A.C., Pinkard, D.C., Pliskin, D.P., Podhasky, J., Quijano, V.J.,

Raczy, C., Rae, V.H., Rawlings, S.R., Rodriguez, A.C., Roe, P.M., Rogers, J.,

Bacigalupo, M.C.R., Romanov, N., Romieu, A., Roth, R.K., Rourke, N.J., Rue-

diger, S.T., Rusman, E., Sanches-Kuiper, R.M., Schenker, M.R., Seoane, J.M.,

135

Shaw, R.J., Shiver, M.K., Short, S.W., Sizto, N.L., Sluis, J.P., Smith, M.A.,

Sohna, J.E.S., Spence, E.J., Stevens, K., Sutton, N., Szajkowski, L., Tregidgo,

C.L. Turcatti, G., VandeVondele, S., Verhovsky, Y., Virk, S.M., Wakelin, S.,

Walcott, G.C., Wang, J., Worsley, G.J., Yan, J., Yau, L., Zuerlein, M., Rogers,

J., Mullikin, J.C., Hurles, M.E., McCooke, N.J., West, J.S., Oaks, F.L.,

Lundberg, P.L., Klenerman, D., Durbin, R. and Smith A.J. (2008) Accurate

whole human genome sequencing using reversible terminator chemistry, Na-

ture, 456(7218):53-59

[15] McKernan, K.J., Peckham, H.E., Costa, G.L., McLaughlin, S.F., Fu, Y., Tsung,

E.F., Clouser, C.R., Duncan, C., Ichikawa, J.K., Lee, C.C., Zhang, Z., Ranade,

S.S., Dimalanta, E.T., Hyland, F.C., Sokolsky, T.D., Zhang, L., Sheridan, A.,

Fu, H., Hendrickson, C.L., Li, B., Kotler, L., Stuart, J.R., Malek, J.A., Maning,

J.M., Antipova, A.A., Perez, D.S., Moore, M.P., Hayashibara, K.C., Lyons,

M.R., Beaudoin, R.E., Coleman, B.E., Laptewicz, M.W., Sannicandro, A.E.,

Rhodes, M.D., Gottimukkala, R.K., Yang, S., Bafna, V., Bashir, A., MacBride,

A., Alkan, C., Kidd, J.M., Eichler, E.E., Reese, M.G., Vega, F.M.D.L and

Blanchard, A.P. (2009) Sequence and structural variation in a human genome

uncovered by short-read, massively parallel ligation sequencing using two-

based encoding, Genome Research, 19, 1527-1541

[16] Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, j., Mileski, W., Davey, M.,

Leamon, J.H., Johnson, k., Milgrew, M.J., Edwards, M. Hoon, J., Simons, J.F.,

Marran, D., Myers, J.W., Davidson, J.F., Branting, A., Nobile, J.R., Puc, B.P.,

Light, D., Clark, T.A., Huber, M., Branciforte, J.T., Stoner, I.B., Cawley, S.E.,

Lysons, M., Fu, Y., Homer, N., Sedova, M., Miao, X., Reed, B., Sabina, J.,

Feierstein, E., Schorn, M., Alanjary, M., Dimalanta, E., Dressman, D., Kasin-

skas, R., Sokolsky, T., Fidanza, J.A., Namsaraev, E., McKernan, K.J., Wil-

liamns, A., Roth, G.T. and Bustillo, J. (2011) An integrated semiconductor de-

vice enabling non-optical genome sequencing, Nature, 475(7356): 348-352

136

[17] Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso. P., Rank, D.,

Baybayan, P., Bettman, B., Bibillo, A., Bjornson, K., Chaudhuri, B., Christians,

F., Cicero, R., Clark, S., Dalal, R., DeWinter, A., Dixon, J., Foquet, M.,

Gaertner, A., hardenbol, P., Heiner, C., Hester, K., Holden, D., Kearns, X.K.,

Kuse, R., Lacroix, Y., Lin, S., Lundquist, P., Ma, C., Marks, P., Maxham, M.,

Murphy, D., Park, I., Pham, T., Phillips, M., Roy, J., Sebra, R., Shen, G.,

Sorenson, J., Tomaney, A., Travers, K., Trulson, M., Vieceli, J., Wegener, J.,

Wu, D., Yang, A., Zaccarin, D., Zhao, P., Zhong, F., Korlach, J. and Turner, S.

(2009) Real-time DNA sequencing from single polymerase molecules, Science,

323(5910): 133-138

[18] Harismendy, O., Ng, P.C., Strausberg, R.L., Wang, X., Stock, T.B., Beeson,

K.Y., Schork, N.J., Murray, S.S., Topol, E.J., Levy, S. and Frazer, K.A. (2009)

Evaluation of next generation sequencing platforms for population targeted se-

quencing studies, Genome Biology, 10: R 32

[19] Bonetta, L. (2006) Genome sequencing in the fast lane, Nature Method, 3, 2,

141-147

[20] Weigel, D. and Mott, R. (2009) The 1001 genomes project for Arabidopsis tha-

liana, Genome Biology, 10, 107

[21] The 1000 Genomes Project Consortium, A map of human genome variation

from population-scale sequencing, Nature, 467, 1061-1073

[22] Genome 10K Community of Scientists, Genome 10K: a proposal to obtain

whole-genome sequencer for 10,000 vertebrate species, Journal of Heredity,

100, 659-674

[23] Staden, R. (1979) A strategy of DNA sequencing employing computer pro-

grams, Nucleic Acids Research, 6(7): 2601-10

137

[24] Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L. and Law, M.

(2012) Comparison of Next-Generation Sequencing Systems, Journal of Bio-

medicine and Biotechnology, 251364

[25] Berka, J., Chen, Y.J., Leamon, J.H., Lefkowitz, S., Lohman, K.L., Makhijani,

V.B., Rothberg, J.M., Sarkis, G.J., Srinivasan, M. and Weiner, M.P. (2005)

Bead emulsion nucleic acid amplification, U.S. Patent Application

[26] Foehlich, T., Heindl, D. and Roesler, A. (2010) High-throughput nucleic acid

analysis, U.S. Patent

[27] Ansorge, W.J. (2009) Next-generation DNA sequencing techniques, New Bio-

technology, 25, 4, 195-203

[28] Mardis, E.R. (2008) the impact of next-generation sequencing technology on

genetics, Trends in Genetics, 24, 3, 133-141

[29] Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R.,

Bertoni, A., Swerdlow, H.P. and Gu, Y. (2012) A tale of three next generation

sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illu-

mina MiSeq sequencers, BMC Genomics, 13, 341

[30] Bubnoff, A.V. (2008) Next-Generation Sequencing: The Race Is On, Cell, 132:

721-723

[31] Illumina (2010) Genomic Sequencing, Pub. No. 770-2008-016

[32] Kit, S. (1961) Equilibrium sedimentation in density gradients of DNA prepara-

tion from animal tissues, Journal of Molecular Biology, 3, 711-716

[33] McClintock, B. (1965) Components of action of the regulators Spm and Ac,

Camegie Inst. Wash. Yearb. 64, 527-534

138

[34] Britten, R.J. and Kohne, D.E. (1968) Repeated sequences in DNA, Science, 161,

529-540

[35] La Spada, A.R., Wilson, E.M., Lubahn, D.B., Harding, A.E. and Fischbeck,

K.H. (1991) Androgen receptor gene mutations in X-linked spinal and bulbar

muscular atrophy, Nature, 352, 77-79

[36] Bennetzen, J.L. (2000) Transposable element contributions to plant gene and

genome evolution, Plant Molecular Biology, 42, 251-269

[37] Dooner, H.K. and Weil, C.F. (2007) Give-and-take: interactions between DNA

transposons and their host plant genomes, Curr. Opin. Genet. Dev., 17, 486-492

[38] Lippman, Z., Gendrel, A.V., Black, M., Vaughn, M.W., Dedhia, N., McCombie,

W.R., Lavine, K., Mittal, V., May, B., Kasschau, K.D., Carrignton, J.C., Do-

erge, R.W., Colot, V. and Martienssen, R. (2004) Role of transposable elements

in heterochromatin and epigenetic control, Nature, 430, 471-476 Ouyang,S.

Buell,CR. (2004) The TIGR Plant Repeat Databases: a collective resource for

the identification of repetitive sequences in plants, Nucleic Acids Research, 32,

1: D360-D363

[39] Kumar, P., Chaitanya, P.S. and Nagarajaram, H.A. (2011) PSSRdb: a relational

database of polymorphic simple sequence repeats extracted from prokaryotic

genomes, Nucleic Acids Research, 39: D601-D605

[40] Subramanian, S., Madgula, V.M., George, R., Kumar, S., Pandit, M.W. and

Singh, L. (2003) SSRD: Simple Sequence Repeats Database of the Human Ge-

nome, Comparative and Functional Genomics, 4, 3, 342-345

[41] Gelfand, Y., Rodriguez, A. and Benson, G. (2007) TRDB: The Tandem Repeats

Database, Nucleic Acids Research, 35: D80-D87

139

[42] Nussbaumer, T., Martis, M.M., Roessner, S.K., Pfeifer, M., Bader, K.C., Shar-

ma, S., Gundlach, H. and Spannagl, M. (2013) MIPS PlantsDB: a database

framework for comparative plant genome research, Nucleic Acids Research, 41:

D1144-D1151

[43] Keller, B., Wicker, T. and Matthews, D.E. (2003) TREP: a database for Tritice-

ae repetitive elements, Trends in plant science, 7, 561-562

[44] Macas, J., Meszaros, T. and Nouzova, M. (2002) PlantSat: a specialized data-

base for plant satellite repeats, Bioinformatics, 18: 28-35

[45] Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O. and

Walichiewicz, J. (2005) Repbase Update, a database of eukaryotic repetitive

elements. Cytogenet Genome Res, 110, 462-467.

[46] Smit et al. www.repeatmasker.org

[47] Li, R., Ye, J., Li, S., Wang, J., Han, Y, Ye, C., Wang, J., Yang, H., Yu, J.,

Wong, G.K.S. and Wang, J. (2005) ReAS: recovery of ancestral sequences for

transposable elements from the unassembled reads of a whole genome shotgun.

PLos Computational Biology, doi: 10.1007/s12042-007-9007-5

[48] Price, A.L., Jones, N.C. and Pevzner, P.A. (2005) De novo identification of re-

peat families in large genomes, Bioinformatics, 21, i351-i358

[49] Surya, S., Bridges, S., Magbanua, Z.V. and Peterson, D.G. (2008) Empirical

comparison of ab initio repeat finding programs, Nucleic Acids Research, 36, 7,

2284-2294

140

[50] Sura, S., Bridges, S., Magbanua, Z.V. and Peterson, D.G. (2008) Computational

Approaches and Tools Used in Identification of Dispersed Repetitive DNA Se-

quence, Tropical Plant Biology, 1: 85-96

[51] Bergman, C.M. and Quesneville, H. (2007) Discovering and detecting transpos-

able elements in genome sequences, Briefings in Bioinformatics, 8(6): 382-392

[52] Benson, D.A. et al. (2013) GenBank, Nucleic Acids Research, 41: D36-D42

[53] Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X.,

Liu, B., Yang, B. and Fan, W. (2011) Comparison of the two major classes of

assembly algorithms: overlap-layout-consensus and de-bruijn-graph, Briefings

In Functional Genomics, 11,1, 25-37

[54] Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S.,

Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marcais, G., Pop, M.

and Yorke, J.A. (2012) GAGE: A critical evaluation of genome assemblies and

assembly algorithms, Genome Research, 22: 557-567

[55] Paszkiewicz, K. and Studholme, D.J. (2010) De novo assembly of short se-

quence reads, Briefings In Bioinformatics, 11, 5, 457-472

[56] Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J. and Shen, B. (2011) A Prac-

tical Comparison of De Novo Genome Assembly Software Tolls for Next-

generation Sequencing Technologies, PLoS ONE, 6(3): e17915.

[57] Scheibye-Alsing, K., Hoffmann, S., Frankel, A., Jensen, P., Stadler, P.F., Mang,

Y., Tommerup, N., Gilchrist, M.J., Nygard, A.B., Jorgensen, C.B., Fredholdm,

M. and Gorodkin, J. (2009) Sequence assembly, Computational Biology and

Chemistry, 33, 121-136

141

[58] Staden, R. (1980) A new computer method for the storage and manipulation of

DNA gel reading data, Nucleic Acid Research, 8(16): 3673-3694

[59] Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Ber-

ger, B., Mesirov, J.P. and Lander, E.S. (2002) ARACHNE: a whole-genome

shotgun assembler, Genome Research, 12: 177-189

[60] Havlak, P., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Weinstock,

G.M. and Gibbs, R.A. (2004) The Atlas genome assembly system, Genome Re-

search, 14: 721-732

[61] Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan,

M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H.J., Remington, K.A., Anson,

E.L., Bolanos, R.A., Chou, H.H., Jordan, C.M., Halpern, A.L., Lonardi, S.,

Beasley, E.M., Brandon, R.C., Chen, L., Dunn, P.J., Lai, Z., Liang, Y., Nussker,

D.R., Zhan, M., Zhang, Q., Zheng, X., Rubin, G.M., Adams, M.D. and Venter,

J.C. (2000) A whole-genome assembly of Drosophila, Science, 287: 2196-2204

[62] Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program,

Genome Research, 9: 868-877

[63] Pevzner, P.A., Tang, H. and Tesler, G. (2004) De novo repeat classification and

fragment assembly, Genome Research, 14: 1786-1796

[64] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben,

L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L.,

Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Ir-

zyk. G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.B.,

Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J.,

Lohman, K.L., Lu, H., Makhijani, V.B., McDade,, K.E., McKenna, M.P., My-

ers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth,

G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R.,

142

Tomasz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P.,

Yu, P., Begley, R.F. and Rothberg, J.M. (2005) Genome sequencing in open

microfabricated high density picoliter reactors, Nature, 437: 376-380

[65] Huang, X., Wang, J., Aluru, S., Yang S.P. and Hillier, L. (2003) PCAP: a

whole-genome assembly program, Genome Research, 13: 2164-2170

[66] Mullikin, J.C. and Ning, Z. (2003) The phusion assembler, Genome Research,

13: 81-90

[67] Bastide, M.D.L and McCombie, W.R. (2007) Assembling genomic DNA se-

quences with PHRAP, Curr Protoc Bioinformatics, Chapter 11, Unit 11.4

[68] Wang, J., Wong, G.K.S., Ni, P., Han, Y., Huang, X., Zhang, J., Ye, C., Zhang,

Y., Hu, J., Zhang, K., Xu, X., Cong, L., Lu, H., Ren, X., Ren, X., He, J., Tao,

L., Passey, D.A., Wang, J., Yang, Y., Yu, J. and Li, S. (2002) RePS: a sequence

assembler that masks exact repeats identified from the shotgun data, Genome

Research, 12: 824-831

[69] Pevzner, P.A., Tang, H. and Waterman, M.S. (2001) An Eulerian path approach

to DNA fragment assembly, PNAS, 98, 17, 9748-9753

[70] Pevzer, P.A. and Tang, H. (2001) Fragment assembly with double-barreled data,

Bioinformatics, 17:S225-233

[71] Butler, J. et al. (2008) ALLPATHS: de novo assembly of whole-genome shot-

gun microreads, Genome Research, 18: 810-820

[72] Maccallum, I., Przybylski D., Gnerre, S., Burton, J., Shlyakhter, I., Gnirke, A.,

Malek, J., McKernan, K., Ranade, S., Shea, T.P., Williams, L., Young, S., Nus-

baum C. and Jaffe, D.B. (2009) ALLPATHS 2: small genomes assembled accu-

143

rately and with high continuity from short paired reads, Genome Biology, 10:

R103

[73] Gnerre, S., Maccallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker,

B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., Berlin, A.M., Aird, D., Costello,

M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E.S.

and Jaffe, D.B. (2011) High-quality draft assemblies of mammalian genomes

from massively parallel sequence data. Proc Natl Acad Sci U S A, 108, 1513-

1518.

[74] Zerbino, D.R. and Birney, E. (2008) Velvet: algorithms for de novo short read

assembly using de Bruijn graphs, Genome Research, 18: 810-820

[75] Simpson, J.T. Wong, K., Jackman, S.D., Schein, J.E. Jones, S.J.M. and Birol, I.

(2009) ABySS: A parallel assembler for short read sequence data. Genome Re-

search, 19: 1117-1123

[76] Pevzner, P.A., Tang, H. and Tesler, G. (2004) De novo repeat classification and

fragment assembly, Genome Research, 14: 1786-1796

[77] Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G.,

Kristiansen, K., Li, S., Yang, H., Wang, J. and Wang J (2010) De novo

assembly of human genomes with massively parallel short read sequencing.

Genome Res, 20, 265-272.

[78] Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q.,

Liu, Y., Tang, J., Wu, G., Zhang, H., Shi, Y., Liu, Y., Yu, C., Wang, B., Lu, Y.,

Han, C., Cheung, D.W., Yiu, S.M., Peng, S., Zhu, X., Liu, G., Liao, X., Li, Y.,

Yang, H., Wang, J., Lam, T.W. and Wang, J. (2012) SOAPdenovo 2: an empiri-

cally improved memory-efficient short-read de novo assembler, GigaScience, 1:

18

144

[79] Conway, T.C. and Bromage, A.J. (2011) Succinct data structure for assembling

large genomes, Bioinformatics, 27: 479:486

[80] Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B.,

Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J.,

Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O.A., Leung,

F.C.C., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou,

R., Shen, F., Mu, B., Ni., R., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W.,

Wang, J., WU, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M.,

Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G.,

Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C.C., Lam,

T.T.T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang,

L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M.W., Li,

Q., Ma, L., Guo, Y., Na, A., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen,

Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X.,

Vinar, T., Wang, Y., Lam, T.W., Yiu, S.M., Liu, S., Zhang, H., Li, D., Huang,

Y., Wang, X., Yang, G., Jiang, Z., Wang. J., Qin, N., Li, L., Li, J., Bolund, L.,

Kristiansen, K., Wong, G.K.S. Olson, M., Zhang, X., Li, S., Yang, H. and

Wang, J. (2010) The sequence and de novo assembly of the giant panda ge-

nome, Nature, 2010, 463: 311-317

[81] Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W.J., Wang, X.,

Xie, B., Ni, P., Ren, Y., Zhu, H., Li, J., Lin, K., Jin, W., Fei, Z., Li, G., Staub, J.,

Kilian, A., Vossen, E.A.G.V.D., Wu, Y., Guo, J., He, J., Jia, Z., Ren, Y., Tian,

G., Lu, Y., Ruan, J., Qian, W., Wang, M., Huang, Q., Li, B., Xuan, Z., Cao, J.,

Asan, Wu, Z., Zhang, J., Cai, Q., Bai, Y., Zhao, B., Han, Y., Li, Y., Li, X.,

Wang, S., Shi, Q., Liu, S., Cho, W.K., Kim, J., Xu, Y., Heller-Uszynska, K.,

Miao, H., Cheng, Z., Zhang, S., Wu, J., Yang, Y., Kang, H., Li, M., Liang, H.,

Ren, X., Shi, Z., Shi, Z., Wen, M., Jian, M., Yang, H., Zhang, G., Yang, Z.,

Chen, R., Liu, S., Li, J., Ma, L., Liu, H., Zhou, Y., Zhao, J., Fang, X., Li, G.,

Fang, L., Li, Y., Liu, D., Zheng, H., Zhang, Y., Qin, N., Li, Z., Zhang, X.,

Yang, H., Wang, J., Sun, R., Zhang, B., Jiang, S., Wang, J, Du, Y. and Li, S

145

(2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics, 41:

1275-1281

[82] Warren, R.L., Sutton, G.G., Jones, S.J. and Holt, R.A. (2007) Assembling mil-

lions of short DNA sequences using SSAKE, BioinfoBrmatics, 23: 500-501

[83] Dohm, J.C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2007) SHARCGS, a

fast and highly accurate short-read assembly algorithm for de novo genomic se-

quencing, Genome Research, 23: 2942-2944

[84] Jeck, W.R., Reinhardt, J.A., Baltrus, D.A., Hickenhotham, M.T., Magrini, V.,

Mardis, E.R., Dangl, J.L. and Jones, C.D. (2007) Extending assembly of short

DNA sequences to handle error, Bioinformatics, 23:2942-2944

[85] Bryant, D.W., Wong, W.K. and Mockler, T.C. (2009) QSRA: a quality-value

guided de novo short read assembler, BMC Bioinformatics, 10: 69

[86] Simpson, J.T. and Durbin, R. (2012) Efficient de novo assembly of large ge-

nomes using compressed data structures, Genome Research, 22: 549-556

[87] Boisvert, S., Laviolette, F. and Corbeil, J. (2010) Ray: Simultaneous Assembly

of Reads from Mix of High-Throughput Sequencing Technologies, Journal of

Computational Biology, 17, 11, 1519-1533

[88] Schmidt, B., Sinha, R., Smith, B.B. and Puglisi, S.J. (2009) A fast hybrid short

read fragment assembly algorithm, Bioinformatics, 25, 17, 2279-2280

[89] Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D. and Pirovano, W. (2011)

Scaffolding pre-assembled contigs using SSPACE, Bioinformatisc, 27: 578-579

146

[90] Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009) Pebble

and rock band: heuristic resolution of repeats and scaffolding in the velvet

short-read de novo assembler, PLoS One, 4: e8407

[91] Dayarian, A., Michael, T.P. and Sengupta, A.M. (2010) SOPRA: scaffolding

algorithm for paired reads via statistical optimization, BMC Bioinformatics, 11:

345

[92] Koren, S., Miller, J.R., Walenz, B. and Sutton, G. (2010) An algorithm for au-

tomated closure during assembly, BMC Bioinformatics, 11: 457

[93] Tsai, I.J., Otto, T.D. and Berriman, M. (2010) Improving draft assemblies by

iterative mapping and assembly of short reads to eliminate gaps, Genome Biolo-

gy, 11: R41

[94] Chaisson, M.J., Brinza, D. and Pevzner, P.A. (2009) De novo fragment assem-

bly with short mate-paired reads: Does the read length matter? Genome Re-

search, 19: 336-346

[95] Earl, D., Bradnam, K., John, J.S., Darling, A., Lin, D., Fass, J., Yu, H.O.K.,

Buffalo, V., Zerbino, D.R., Diekhans, M., Nguyen, N., Ariyaratne, P.N., Sung,

W.K., Ning, Z., Haimel, M., Simpson, J.T., Fonseca, N.A., Birol, I., Docking,

T.R., Ho, I.Y., Rokhsar, D.S., Chikhi, R., Lavenier, D., Chapuis, G., Naquin, D.,

Maillet, N., Schatz, M.C., Kelley, D.R., Phillippy, A.M., Koren, S., Yang, S.P.,

Wu, W., Chou, W.C., Srivastava, A., Shaw, T.I., Ruby, J.G., Skewes-Cox, P.,

Betegon, M., Dimon, M.T., Solovyev, V., Seledtsov, I., Kosarev, P., Vorobyev,

D., Ramirez-Gonzalez, R., Leggett, R., Maclean, D., Xia, F., Luo, R., Li, Z.,

Xie, Y., Liu, B., Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Yin,

S., Sharpe, T., Hall, G., Kersey, P.J., Durbin, R., Jackman, S.D., Chapman, J.A.,

Huang, X., DeRisi, J.L., Caccamo, M., Li, Y., Jaffe, D.B., Green, R.E.,

Haussler, D., Korf, I. and Paten, B. (2011) Assemblathon 1: A competitive as-

147

sessment of de novo short read assembly methods, Genome Research, 21: 2224

– 2241.

[96] Parra, G., Bradnam, K., Ning, Z., Keane, T. and Korf, I. (2009) Assessing the

gene space in draft genomes, Nucleic Acids Research, 37: 289-297

[97] Schatz, M.C., Delcher, A.L. and Salzberg, S.L. (2 010) Assembly of large ge-

nomes using second-generation sequencing, Genome Research, 20: 1165-1173

[98] Bresler, M., Sheehan, S., Chan, A.H. and Song, Y.S. (2012) Telescoper: de no-

vo assembly of highly repetitive regions, Bioinformatics, 28: i311-i377.

[99] Pop, M. (2009) Genome assembly reborn: recent computational challenges,

Briefings in Bioinformatics, 10, 4, 354-366

[100] Myers, E.W. (2005) The fragment assembly string graph, Bioinformatics, 21:

ii79-ii85

[101] Metzker, M.L. (2010) Sequencing technologies - the next generation. Nat Rev

Genet, 11, 31-46.

[102] Miller, J.R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-

generation sequencing data. Genomics, 95, 315-327.

[103] El-Metwally, S., Hamza, T., Zakaria, M. and Helmy, M. (2013) Next-

generation sequence assembly: four stages of data processing and

computational challenges. PLoS Comput Biol, 9, e1003345.

[104] Ohno, S. (1972) So much "junk" DNA in our genome. Brookhaven Symp Biol,

23, 366-370.

148

[105] Doolittle, W.F. and Sapienza, C. (1980) Selfish genes, the phenotype

paradigm and genome evolution. Nature, 284, 601-603.

[106] Orgel, L.E. and Crick, F.H. (1980) Selfish DNA: the ultimate parasite. Nature,

284, 604-607.

[107] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J.,

Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001) Initial

sequencing and analysis of the human genome. Nature, 409, 860-921.

[108] Rinn, J.L. and Chang, H.Y. (2012) Genome regulation by long noncoding

RNAs. Annu Rev Biochem, 81, 145-166.

[109] Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte,

M., Zuk, O., Carey, B.W., Cassady, J.P. et al. (2009) Chromatin signature

reveals over a thousand highly conserved large non-coding RNAs in

mammals. Nature, 458, 223-227.

[110] Rollins, R.A., Haghighi, F., Edwards, J.R., Das, R., Zhang, M.Q., Ju, J. and

Bestor, T.H. (2006) Large-scale structure of genomic methylation patterns.

Genome Res, 16, 157-163.

[111] Smit, A.F. (1996) The origin of interspersed repeats in the human genome.

Curr Opin Genet Dev, 6, 743-748.

[112] Subramanian, S., Mishra, R.K. and Singh, L. (2003) Genome-wide analysis

of microsatellite repeats in humans: their abundance and density in specific

genomic regions. Genome Biol, 4, R13.

[113] Gemayel, R., Vinces, M.D., Legendre, M. and Verstrepen, K.J. (2010)

Variable tandem repeats accelerate evolution of coding and regulatory

sequences. Annu Rev Genet, 44, 445-477.

149

[114] O'Dushlaine, C.T., Edwards, R.J., Park, S.D. and Shields, D.C. (2005)

Tandem repeat copy-number variation in protein-coding regions of human

genes. Genome Biol, 6, R69.

[115] Mularoni, L., Guigo, R. and Alba, M.M. (2006) Mutation patterns of amino

acid tandem repeats in the human proteome. Genome Biol, 7, R33.

[116] McMurray, C.T. (2010) Mechanisms of trinucleotide repeat instability during

human development. Nat Rev Genet, 11, 786-799.

[117] MacDonald, M.E., et al. (1993) A novel gene containing a trinucleotide

repeat that is expanded and unstable on Huntington's disease chromosomes.

The Huntington's Disease Collaborative Research Group. Cell, 72, 971-983.

[118] La Spada, A.R., Wilson, E.M., Lubahn, D.B., Harding, A.E. and Fischbeck,

K.H. (1991) Androgen receptor gene mutations in X-linked spinal and bulbar

muscular atrophy. Nature, 352, 77-79.

[119] Hannan, A.J. (2010) Tandem repeat polymorphisms: modulators of disease

susceptibility and candidates for 'missing heritability'. Trends Genet, 26, 59-

65.

[120] Boland, C.R., Thibodeau, S.N., Hamilton, S.R., Sidransky, D., Eshleman, J.R.,

Burt, R.W., Meltzer, S.J., Rodriguez-Bigas, M.A., Fodde, R., Ranzani, G.N.

et al. (1998) A National Cancer Institute Workshop on Microsatellite

Instability for cancer detection and familial predisposition: development of

international criteria for the determination of microsatellite instability in

colorectal cancer. Cancer Res, 58, 5248-5257.

[121] Press, M.O., Carlson, K.D. and Queitsch, C. (2014) The overdue promise of

short tandem repeat variation for heritability. Trends Genet, 30, 504-512.

150

[122] Walsh, P.S., Fildes, N.J. and Reynolds, R. (1996) Sequence analysis and

characterization of stutter products at the tetranucleotide repeat locus vWA.

Nucleic Acids Res, 24, 2807-2812.

[123] Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan,

Q., Liu, Y. et al. (2012) SOAPdenovo2: an empirically improved memory-

efficient short-read de novo assembler. Gigascience, 1, 18.

[124] Gymrek, M., Golan, D., Rosset, S. and Erlich, Y. (2012) lobSTR: A short

tandem repeat profiler for personal genomes. Genome Res, 22, 1154-1162.

[125] Highnam, G., Franck, C., Martin, A., Stephens, C., Puthige, A. and Mittelman,

D. (2013) Accurate human microsatellite genotypes from high-throughput

resequencing data using informed error profiles. Nucleic Acids Res, 41, e32.

[126] Cao, M.D., Tasker, E., Willadsen, K., Imelfort, M., Vishwanathan, S.,

Sureshkumar, S., Balasubramanian, S. and Boden, M. (2014) Inferring short

tandem repeat variation from paired-end short reads. Nucleic Acids Res, 42,

e16.

[127] Noe, L. and Kucherov, G. (2005) YASS: enhancing the sensitivity of DNA

similarity search. Nucleic Acids Res, 33, W540-543.

[128] Koren, S., Treangen, T.J. and Pop, M. (2011) Bambus 2: scaffolding

metagenomes. Bioinformatics, 27, 2964-2971.

[129] Miller, J.R., Delcher, A.L., Koren, S., Venter, E., Walenz, B.P., Brownley, A.,

Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly

of pyrosequencing reads with mates. Bioinformatics, 24, 2818-2824.

151

[130] Benson, G. (1999) Tandem repeats finder: a program to analyze DNA

sequences. Nucleic Acids Res, 27, 573-580.

[131] Bao, Z. and Eddy, S.R. (2002) Automated de novo identification of repeat

sequence families in sequenced genomes. Genome Res, 12, 1269-1276.

[132] Koch, P., Platzer, M. and Downie, B.R. (2014) RepARK--de novo creation of

repeat libraries from whole-genome NGS reads. Nucleic Acids Res, 42, e80.

[133] Darzentas, N. (2010) Circoletto: visualizing sequence similarity with Circos.

Bioinformatics, 26, 2620-2621.

[134] Dear, S. and Standen, R. (1992) A standard file format for data from DNA

sequencing instruments, DNA Seq., 3(2): 107-10.

[135] Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1998) Base-calling of auto

mated sequencer traces using phred. I. Accuracy assessment, Genome

Research, 8 (3): 186-194.

[136] Ewing, B. and Green, P (1998) Base-calling of automated sequencer traces us

ing phred. II. Error probabilities, Genome Research, 8 (3): 186-194.

[137] Richterich, P. (1998) Estimation of errors in raw DNA sequences: a valida-

tion study, Genome Research, 8 (3): 251-259.

[138] Schneeberger, K., Ossowski, S., Ott, F., Klein, J.D., Wang, X., Lanz, C.,

Smith, L.M., Cao, J., Joffrey, F., Warthmann, N., Henz, S.R., Huson, D.H.

and Weigel, D. (2011) Reference-guided assembly of four diverse Arabidop-

sis thaliana ge nomes, PNAS, vol. 108 no. 25.

[139] Pelak, K., Shianna, K.V., Ge, D., Maia, J.M., Zhu, M., Smith, J.P., Cirulli,

E.T., Fellay, J., Dickson, S.P., Gumbs, C.E., Heinzen, E.L., Need, A.C., Ruz-

152

zo, E.K., Singh, A., Cambell, C.Y., Hong, L.K., Lornsen, K.A., McKenzie,

A.M., Sobreira, N,L,M., Hoover-Fong, J.E., Milner, J.D., Ottman, R., Haynes,

B.F., Goedert, J.J. and Goldstein, D.B. (2010) The characterization of twenty

sequenced human genomes. PLoS Genet 6:e1001111.

[140] Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M., and Yorke, J. A. (2004)

Reducing storage requirements for biological sequence comparison. Bioinfo

matics 20, 3363–3369.

Documents

Repetitive DNA sequence assembly › 023d › c00548189e... · I have learnt how to conduct experiments, write papers and much more from you. I would like to thank my colleagues and