Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Repetitive DNA sequence assembly
by
Yongqing Jiang
Bachelor of IT (Honours)
Submitted in fulfilment of the requirements for the degree of
Doctor of Philosophy
Deakin University
November, 2017
IV
Acknowledgements
I would like to thank my supervisor, Professor Wanlei Zhou for his continuing
guidance, encouragement and support. I would like to show my deepest respect for
his personality and I am sincerely gratitude for his patience during my candidature. I
would like to thank my co-supervisor, Dr. Jingyu Hou. I could not be here without
your guidance and encourage during my honours and PhD candidature. I have learnt
how to conduct experiments, write papers and much more from you.
I would like to thank my colleagues and friends who accompany by my side dur-
ing my ups and downs: Ashish Saini, Chao Chen, Sheng Wen, Longxiang Gao, Yu
Wang, Tianqing Zhu, Xiao Chen, Derui Wang, Jiaojiao Jiang, Tianrui Zong,
Youyang Qu, Lichuan Ma and much more.
I am extremely grateful Deakin University for offering DUPR international schol-
arship, which gives me the opportunity to start my PhD without financial issues.
Last but not least, I would like to thank my parents for their endless love, care and
unconditional support. You share my happiness and sadness anytime anywhere. I
own my deepest gratitude to you.
V
Repetitive DNA sequence assembly
Mr. Yongqing Jiang
Submitted for the degree of Doctor of Philosophy
November 2017
Abstract
With the advent of next generation of sequencing, geneticists are able to obtain
high-throughput of sequence data in timely. The length of these sequence reads typi-
cally range from 20 base pairs to 150 base pairs. These short reads generate issues to
existing assembly methods which performed well on long reads from the first gen-
eration sequencing. We reviewed the performance of existing assembly methods to
NGS data. The results show: 1) most tandem repetitive DNA sequences are poorly
assembled; 2) DNA transposons are either partially assembled or incorrectly assem-
bled; 3) most retro-transposons are not assembled. However, eukaryotic genomes
contain high volumes of intronic and intergenic regions in which repetitive sequences
are abundant. These repetitive sequences represent challenges in genome assembly of
short reads and are often excluded in analysis losing invaluable genomic information.
This thesis proposes novel computational methods to assemble repetitive DNA
sequences using next generation sequencing (NGS) data. Previous methods focus on
whole-genome assembly, so they treat repetitive regions and non-repetitive regions
as same when assembling. Our methods are distinct from previous methods since
they dedicate to repetitive sequence assembly. Specifically, our methods solve three
categories of repetitive sequence: 1) tandem repetitive sequence; 2) DNA transposon
and 3) retrotransposon. Our work mainly consists three parts. First, we propose a se-
ries of novel methods to assemble repetitive DNA sequences. Second, we measure
the performance of our methods by comparing to previous methods and also compar-
ing to reference genome. Experiments based on real dataset show our proposed
VI
methods outperform other methods and are more accurate when comparing to refer-
ence genome. Three, we carry out further analysis on our assembled results. The re-
sults show: 1) variations are existing in our assembled sequences; 2) there are abun-
dant repetitive sequences at nucleic associated domain; 3) retrotransposons can be
correctly assembled but still hard to completely assemble due to their wide distribu-
tion.
Keywords: NGS, Sequence assembly, Repetitive DNA sequence, Transposon
VII
Table of Contents
Acknowledgements .................................................................................................... IV
Abstract ....................................................................................................................... V
List of Figures ............................................................................................................. X
List of Tables ........................................................................................................... XIII
Acronyms ................................................................................................................. XV
1 Introduction ........................................................................................................... 1
1.2 Background .................................................................................................. 1
1.3 Motivation and Research issues ................................................................... 2
1.4 Overview the work and its contribution ...................................................... 7
1.5 Thesis organization ...................................................................................... 8
2 Literature Review ................................................................................................ 10
2.1 Review of sequencing technology ............................................................. 10
2.1.1 Description of main sequencing technologies ............................... 11
2.1.2 Technical comparison of sequencing technologies ........................ 14
2.2 Review of data in sequence assembly ........................................................ 15
2.2.1 Types of reads ................................................................................ 15
2.2.2 Repeat reference ............................................................................. 19
2.3 Review of assembly methods .................................................................... 21
2.3.1 Assembly Methods overview ......................................................... 22
2.3.2 Overlap-Layout-Consensus Based methods .................................. 24
2.3.3 De Bruijn Graph based methods .................................................... 26
2.3.4 Measurement .................................................................................. 29
2.3.5 Issues of Assembly methods .......................................................... 32
3 Innovative method of tandem repetitive sequence assembly .............................. 34
3.1 Introduction ................................................................................................ 34
VIII
3.2 Methods ..................................................................................................... 37
3.2.1 Tandem repeat definition ............................................................... 37
3.2.2 Algorithm overview ....................................................................... 38
3.2.3 Merge of paired-end reads ............................................................. 41
3.2.4 Candidate reads selection ............................................................... 41
3.2.5 Contig generation method .............................................................. 42
3.3 Result ......................................................................................................... 43
3.3.1 Contig generation ........................................................................... 43
3.3.2 Correctness of assembly................................................................. 44
3.3.3 Contig distribution and SNP verification ....................................... 47
3.3.4 Discussion ...................................................................................... 50
3.4 Summary .................................................................................................... 53
4 Innovative assembly method of DNA transposon ............................................... 54
4.1 Introduction ................................................................................................ 54
4.2 Method ....................................................................................................... 56
4.2.1 Overview and basic terms .............................................................. 56
4.2.2 Seed selection ................................................................................. 58
4.2.3 Computing sequence matching score ............................................. 64
4.2.4 Extension algorithm ....................................................................... 66
4.3 Results ........................................................................................................ 68
4.3.1 Parameter setting ............................................................................ 69
4.3.2 Performance comparison between ABySS2, Velvet, and
SOAPdenovo ............................................................................... 74
4.3.3 Distribution and variation .............................................................. 81
4.4 Summary .................................................................................................... 88
5 Innovative assembly method of retro transposon ................................................ 89
5.1 Introduction ................................................................................................ 89
5.2 Method ....................................................................................................... 90
5.2.1 Overview and basic terms .............................................................. 90
5.2.2 De Bruijn graph generation ............................................................ 92
5.2.3 Simple path algorithm .................................................................... 95
5.2.4 Select LTR feature seeds ................................................................ 98
IX
5.2.5 Select Non-LTR feature seeds ..................................................... 101
5.2.6 Gap filling method ....................................................................... 104
5.3 Result ....................................................................................................... 106
5.3.1 Parameter setting .......................................................................... 107
5.3.2 Performance comparison between ABySS2, Velvet, and
SOAPdenovo ............................................................................. 116
5.3.3 Assemble nucleolus associated domain using RA method .......... 122
5.4 Summary .................................................................................................. 128
6 Conclusion and Future work ............................................................................. 129
6.1 Contributions ........................................................................................... 129
6.2 Future works ............................................................................................ 131
Bibliography ............................................................................................................. 132
X
List of Figures
Figure 1.1: Repetitive DNA sequence makes gap. ....................................................... 3
Figure 1.2: Repetitive DNA sequence in de Bruijn graph based methods. .................. 4
Figure 1.3: Repetitive DNA sequence makes issue of losing information .................. 5
Figure 2.1: Generation of single end reads. ................................................................ 16
Figure 2.2: Generation of paired end reads. ............................................................... 17
Figure 2.3: Generation of mate pair reads. ................................................................. 19
Figure 2.4: Illustration of OLC based method. .......................................................... 25
Figure 2.5: Demonstration of de Bruijn graph generation. ........................................ 27
Figure 3.1: Tandem repeats in human chromosome 14. ............................................ 38
Figure 3.2: Workflow of tandem repeat assembler. ................................................... 40
Figure 3.3: Correctness comparison of contigs assembled by different methods for
tandem repeats. ........................................................................................ 47
Figure 3.4: SNP verification and potential new SNP loci. ......................................... 48
Figure 3.5: The distribution of correct contigs. .......................................................... 49
Figure 3.6: Influence of reads quality on assembly. ................................................... 52
Figure 4.1: Types of transposable elements and their structures. .............................. 55
Figure 4.2: Overview of DNA transposon assembly method. ................................... 58
Figure 4.3: Dot matrix of two sequences. .................................................................. 59
Figure 4.4: Dot matrix of two sequences which have DNA transposon features. ..... 64
Figure 4.5: Encoding scheme of different sequencing technologies. ......................... 65
Figure 4.6: Sequence extension demonstration. ......................................................... 67
Figure 4.7: Assembly performance on different .................................................... 70
Figure 4.8: Assembly performance on different . .................................................... 71
Figure 4.9: Assembly performance on different . ................................................. 71
Figure 4.10: Assembly performance on different . ................................................ 72
Figure 4.11: Assembly performance on different . ................................................. 73
Figure 4.12: Assembly performance on different . .................................................. 73
XI
Figure 4.13: Precision comparison of DTA, ABySS2, Velvet, and SOAPdenovo
method on Human chromosome 14. ..................................................... 75
Figure 4.14: Recall comparison of DTA, ABySS2, Velvet, and SOAPdenovo method
on Human chromosome 14. .................................................................. 76
Figure 4.15: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo
method on Human chromosome 14. ..................................................... 77
Figure 4.16: The precision-recall curves of DTA, ABySS2, Velvet, and
SOAPdenovo method on Maize chromosome 8. .................................. 78
Figure 4.17: Precision comparison of DTA, ABySS2, Velvet, and SOAPdenovo
method on Maize chromosome 8. ......................................................... 79
Figure 4.18: Recall comparison of DTA, ABySS2, Velvet, and SOAPdenovo method
on Maize chromosome 8. ...................................................................... 80
Figure 4.19: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo
method on Maize chromosome 8. ......................................................... 80
Figure 4.20: Assembly result of MamRep434 mapping to reference. ....................... 81
Figure 4.21: Assembly result of 28 DNA transposons mapping to reference. ........... 82
Figure 4.22: Assembly result of 103 DNA transposons mapping to reference. ......... 83
Figure 4.23: Assembly result of 163 DNA transposons mapping to reference. ......... 84
Figure 4.24: Frequency of indel lengths and variation of coding sequence lengths in
chromosome 14. .................................................................................... 87
Figure 5.1: Overview of retrotransposon assembly method ....................................... 92
Figure 5.2: De Bruijn graph with k = 24 of sequence
ATCGAACGTTTTACCAGCGATTCA. ............................................. 94
Figure 5.3: Demonstration of simple path generation process. .................................. 98
Figure 5.4: Dot matrix of two sequences which have LTR features. ......................... 99
Figure 5.5: Dot matrix of two sequences which have Non-LTR features. ............... 102
Figure 5.6: Demonstration of contig generation of retrotransposon ........................ 105
Figure 5.7: Performance of retrotransposon assembly method on different . ....... 108
Figure 5.8: Performance of retrotransposon assembly method on different . ....... 109
Figure 5.9: Performance of retrotransposon assembly method on different . ..... 110
Figure 5.10: Performance of retrotransposon assembly method on different ..... 111
Figure 5.11: Performance of retrotransposon assembly method on different . ..... 112
XII
Figure 5.12: Performance of retrotransposon assembly method on different . ...... 113
Figure 5.13: Performance of retrotransposon assembly method on different . ....... 114
Figure 5.14: Performance of retrotransposon assembly method on different . ....... 114
Figure 5.15: Performance of retrotransposon assembly method on different . ..... 115
Figure 5.16: Performance of retrotransposon assembly method on different . ....... 116
Figure 5.17: Precision comparison of RA, ABySS2, Velvet, and SOAPdenovo
method on Human chromosome 14. ................................................... 117
Figure 5.18: Recall comparison of RA, ABySS2, Velvet, and SOAPdenovo method
on Human chromosome 14. ................................................................ 118
Figure 5.19: F-value comparison of RA, ABySS2, Velvet, and SOAPdenovo method
on Human chromosome 14. ................................................................ 118
Figure 5.20: Assembly result of PABL_A (LTR type retrotransposon) mapping to
reference. ............................................................................................. 119
Figure 5.21: Assembly result of AluSx (SINE, non-LTR type retrotransposon)
mapping to reference. .......................................................................... 120
Figure 5.22: Assembly result of L1PB (LINE, non-LTR type retrotransposon)
mapping to reference. .......................................................................... 121
Figure 5.23: Simple path of nucleolus-associated domains. .................................... 122
Figure 5.24: NAD gene distribution. ........................................................................ 126
XIII
List of Tables
Table 2.1: Description of main sequencing technologies. .......................................... 12
Table 2.2: Comparison of sequencing technologies ................................................... 15
Table 2.3: Description of repetitive DNA sequences. ................................................ 19
Table 2.4: Databases of repetitive DNA sequence. ................................................... 20
Table 2.5: Overview of assembly methods. ............................................................... 23
Table 2.6: Measurement of assembly methods [95]. .................................................. 30
Table 3.1: Assembly of tandem repeats in human Chromosome 14. ......................... 43
Table 3.2: Summary of contigs assembled by different assemblers for tandem repeats.
................................................................................................................. 44
Table 3.3: Summary of correct assembly of assemblers for tandem repeats. ............ 45
Table 4.1: Dot matrix based alignment method. ........................................................ 60
Table 4.2: DNA transposon feature finding method. ................................................. 61
Table 4.3: The meaning of Phred quality score .......................................................... 64
Table 4.4: Parameter setting on . ............................................................................ 70
Table 4.5: Parameter setting on . ............................................................................. 70
Table 4.6: Parameter setting on . .......................................................................... 71
Table 4.7: Parameter setting on . ........................................................................... 72
Table 4.8: Parameter setting on . ............................................................................ 72
Table 4.9: Parameter setting on . ............................................................................. 73
Table 4.10: Summary of DNA transposons assembled by different assemblers ....... 78
Table 4.11: Deletion and insertion of different length in chromosome 14. ............... 86
Table 4.12: Highly diverged regions of different length in chromosome 14. ............ 86
Table 5.1: Simple path generation method. ................................................................ 96
Table 5.2: Long terminal repeat feature finding method .......................................... 100
Table 5.3: Simple repeat finding method ................................................................. 103
Table 5.4: Non – LTR feature finding method ......................................................... 103
Table 5.5: Contig generation method ....................................................................... 106
XIV
Table 5.6: Parameter setting on . ........................................................................... 108
Table 5.7: Parameter setting on . ........................................................................... 109
Table 5.8: Parameter setting on . ......................................................................... 110
Table 5.9: Parameter setting on . ......................................................................... 111
Table 5.10: Parameter setting on . ......................................................................... 112
Table 5.11: Parameter setting on . ......................................................................... 112
Table 5.12: Parameter setting on . .......................................................................... 113
Table 5.13: Parameter setting on . ........................................................................... 114
Table 5.14: Parameter setting on . ......................................................................... 115
Table 5.15: Parameter setting on . ........................................................................... 115
Table 5.16: Statistics of assembled NAD in Human genome .................................. 124
Table 5.17: No of gene in NAD ............................................................................... 125
Table 5.18: Genes that have functions only in NAD. .............................................. 127
Table 5.19: 10 gene functions only associated with NAD ....................................... 127
Table 5.20: 10 gene functions not associated with NAD ......................................... 128
XV
Acronyms
BP Base Pair
DR Directed Repeat
DTA DNA Transposon Assembler
Env Envelop
Gag Group-specific Antigen
LINE Long Interspersed Nuclear Element
LTR Long Terminal Repeat
NAD Nucleolus Associated Domain
NGS Next Generation Sequencing
ORF Open Reading Frame
Pol Polymerase
Prt Protease
RA Retrotransposon Assembler
SINE Short Interspersed Nuclear Element
SNP Single Nucleotide Polymorphism
TIR Terminal Inverted Repeat
TE Transposable Element
TRA Tandem Repeat Assembler
TSD Target Site Duplication
UTR Untranslated Regi
1
Chapter 1
1 Introduction
1.2 Background
DNA (Deoxyribonucleic acid) encodes the genetic information of living organ-
isms. It consists of 4 basic nucleotides, adenine (A), guanine (G), cytosine (C) and
thymine (T). Determining the sequence of DNA in an organism is the essential step
to do downstream biological analysis, such as finding pathogenic genes. Sequencing
technology plays the role of reading the sequence of a given DNA. The first genera-
tion of sequencing technology started in 1977, with the invention of Sanger’s method
[1]. This method combined with whole genome shotgun sequencing was the main
method by which first human genome had been sequenced in 2003. However, it took
3-4 years and 300 million dollars sequencing 3 billion base pairs of human genome
with read length of ~900 base pairs [2]. The obvious problem here is how to assem-
ble these fragment reads back into complete DNA sequence.
Sequence assembly methods were introduced to address this issue. With massive
parallel sequencing emerging from 2004, represented by Roche 454, Illumina GA
2
and ABI SOLiD, high-throughput reads have been yielded greater than Sanger’s
method by several orders of magnitude. We call these sequencing technologies as
next generation sequencing, because they have faster sequencing speed and lower
cost on producing reads, comparing to Sanger’s method. With next generation se-
quencing (NGS), a human genome can be sequenced in a week. Meantime, NGS
generates ~300 million reads with even shorter length, ~100 base pairs. This makes a
big data issue and most importantly raises problems of assembling repetitive DNA
sequences.
Repetitive DNA sequence is a natural feature of genome in most organisms [3]. It
takes a large proportion of DNA sequence. For example, it comprises 50% of human
genome, 37% of mouse genome, up to 80% of maize genome [4], even the whole ge-
nome of Arabidopsis thaliana [5]. These repeat sequences range size from 1 base to
millions bases, and differ function in simple tandem repeats and transposable ele-
ments [6]. In this thesis, we only concern repetitive DNA sequences of which length
are longer than read length.
In terms of the pattern of repetitive DNA sequence, there are repeat patterns re-
peating many times adjacent to each other in a DNA sequence, or same repeat pattern
spreading over multiple places across whole genome. However, NGS generates short
read length. This feature of NGS data challenges sequence assembler because short
read contains limited information which is not able to assemble repetitive regions.
Sequence assembly methods are based on overlap information of reads. Thus, limited
overlap information leads a poor assembly result with gaps and errors, especially for
the regions of repetitive DNA sequences.
1.3 Motivation and Research issues
To date, Human reference genome comes to its 37th version. However, there are
still gaps in this reference [7]. It means we still cannot assemble a complete genome
sequence. From sequence assembly point of view, repetitive DNA sequence corners
assembly algorithms to dilemma. Especially for NGS reads, short length cannot cov-
er a long complete repeat sequence. What makes it worse is that transposable ele-
ments widely distributed in genome, making location ambiguous. We list the issues
of repetitive DNA sequence to assembly methods as follows:
3
Repetitive DNA sequence makes gap in assembly result. Repeats are usu-
ally ignored in existing assembly methods. The reason is that most repeti-
tive DNA sequences are longer than sequence read. To date, the capability
of sequencing technology can only read 50-150 base pairs per time. And
this length cannot cover most repetitive DNA sequences of which the
length are ranging from 200-500 base pairs [5]. It means that we can hard-
ly assemble repetitive DNA sequence. Especially in De novo assembly
methods, we have to assemble sequence read without reference sequence.
In this case, we have no knowledge about what repeats are in genome se-
quence. To avoid mistakes of assembling repetitive regions, sequence
reads that potentially related to repetitive regions are masked. As a result,
most repetitive regions are not assembled. As illustrated in Figure 1.1, a
region with high frequency reads is masked. Methods will filter out these
sequence reads, because this region is regarded as a potential repetitive re-
gion. Consequently, an assembly result with gaps will be generated.
Gap
Original sequence Assembly result
Masked region
: Sequence read
Assemble
Figure 1.1: Repetitive DNA sequence makes gap.
Repetitive DNA sequence makes the assembly of whole genome sequence
shorter than the reference genome. This fact was reported by [8], that re-
cently assemblies using NGS data were much more fragmented than refer-
ence assembled years ago. The reason is that most repetitive DNA se-
quence are longer than sequence read and current assembly has limitations
to tackle this issue. For example, de Bruijn graph based methods break
down sequence read into , which are substrings of sequence read
in length k. And use to generate de Bruijn graph in which
4
are connected by overlapping bases. Based on the path of
de Bruijn graph, we can deduce the DNA sequence. However a repetitive
DNA sequence will present branches or loops in de Bruijn graph. This
forces assemblers to have strategies to choose a path in de Bruijn graph.
As illustrated in Figure 1.2, a repetitive sequence is broke down into
to , and then constructed into a de Bruijn graph. As we can see,
it is hard to find a path in this graph to assemble back the original se-
quence, as there are loops and direction choice problem. In Figure 1.2, the
sequence, TTCTGAGGCTGACCCTGAAA, is constructed into a de
Bruijn graph with 3-mers. This sequence contains a repeat of which the
pattern is CTGA. For de Bruijn graph based assembly methods, it is con-
fusing to select the correct path. The erroneous path selection will assem-
ble a short sequence, for example, TTCTGAAA. For a large genome, re-
petitive DNA sequence will make de Bruijn graph even complicated.
TTC TCT CTG TGA GAA AAA
GCT GAG
CCT GAC
CCC
GGC
ACC
AGG
Original Repetitive sequence: TTCTGAGGCTGACCCTGAAA
Figure 1.2: Repetitive DNA sequence in de Bruijn graph based methods.
Repetitive DNA sequence makes assembly result erroneous in resequenc-
ing project. Resequencing project is to assemble a DNA sequence of
which the reference sequence is available. This means, for example, when
we assemble the repetitive DNA sequence part, we can refer to the refer-
5
ence sequence and “copy and paste” the repetitive DNA sequence part into
assembly result. By this way, final assembly result will not differ much in
length comparing to reference sequence. However, variation information,
such as SNP (Single Nucleotide Polymorphism) is lost in this procedure.
As illustrated in Figure 1.3, when there are loops in the de Bruijn graph,
we can search available repeat database then “copy and paste” suitable re-
petitive DNA sequence into assembly result. This makes up the issue of
short assembly, but it will generate a fake result which has no real varia-
tion information.
Repeat Reference database:
…CTGAGGCTGACCCTCA
... TTC TCT CTG TGA GAA AAA
GCT GAG
CCT GAC
CCC
GGC
ACC
AGG
Original Repetitive sequence: TTCTGAGGCTGACCCTGAAA
TTC TTCLeft flank
Rightflank
Search
Assembly result of Repetitive sequence: TTCTGAGGCTGACCCTCAAA
SNP
Figure 1.3: Repetitive DNA sequence makes issue of losing information
Assembly of repetitive DNA sequence cannot be fully solved by paired
end reads. Paired end reads are a pair of reads with a known distance be-
tween them. The structure of paired end reads is
. Some assembly methods will use this structural feature to de-
duce repetitive DNA sequence. Based on the known distance between
head read and tail read, if one read located in unambiguous region and the
other read located in repeat region, then some nucleotides in repetitive re-
gion can be deducted. If there are enough paired end reads covered in a re-
6
petitive region, theoretically this region can be correctly assembled. How-
ever, there are many regions in a genome share significant similarity. In
this case, assembly methods are confused of where paired end reads actu-
ally come from. For example, two repetitive regions are significantly simi-
lar but in different locations. For assembly method, it is hard to tell how
many regions should be assembled, because the paired reads seem to come
from one region. In fact, transposable elements are widely distributed in
genome and they are important part of genome sequence, for example
there are about 85% transposable elements in maize genome.
According to the issues listed above, it is a trade-off between correct assembly
and complete assembly. If a method wants to assemble a complete sequence of a ge-
nome, it must assemble repetitive regions. However in existing models, to assemble
repetitive regions always comes with errors. Most existing assembly methods are ge-
nome wide assembly oriented. In other words, they are trying to correctly assemble
as much sequence in genome as possible. So they bypass the issues of assembling
repetitive regions. In this thesis, we are motivated to tackle the issues of assembly
repetitive DNA sequences.
There are different types of repeat sequence in genome. Each type of repeat se-
quence has different problems to assembly methods. Based on the issues discussed
above, there are number of important research questions that need to be address in
terms of different types of repeat sequence. We list our research questions as follows:
How to correctly assemble tandem repetitive DNA sequence? Tandem re-
peat is a sequence which has a pattern duplicates several times. The length
of repeat pattern can be longer than the length of sequence read. So one
sub-question is how to determine the number of times a pattern duplicates.
We studied the reference of assembled tandem repeats. It shows that a tan-
dem repetitive sequence does not duplicate its pattern in integer times. In
other words, it may stops repeating at any position where is possible in the
middle of the pattern. Moreover, there could be variation, mismatch and
indel between patterns. To correctly assemble a tandem repeat, we should
7
also assemble its flank sequences at both ends. So another sub-question is
how to recover the flank sequences of a tandem repeat.
How to correctly assemble DNA transposon? DNA transposon is one of
the transposable elements. It follows a “cut-and-paste” mechanism to
move in the genome. The sequence of DNA transposon has a unique struc-
tural feature. There is a pair of inverted repeats at both ends of a DNA
transposon. One sub-question is how to identify sequence reads which are
related to DNA transposon. Assume the sequence reads which are related
to a DNA transposon are identified, there is still problem of bad quality
reads which lead to bad assembly result. So another sub-question is how to
filter out bad quality reads and assemble DNA transposon properly.
How to correctly assemble retrotransposon? Retrotransposon is one of the
transposable elements. In detail, retrotransposon has two types, i.e. LTR
type retrotransposon and non-LTR retrotransposon. Retrotransposon fol-
lows a “copy and paste” mechanism to move in the genome. For a re-
trotransposon, it could have hundreds of copies widely distributed in a ge-
nome. So, one sub-question is how to assemble these significantly similar
copies in different locations. Two types of retrotransposon are different in
structural feature. Specifically, there is no long terminal repeat in non-LTR
type retrotransposon. So, another sub-question is how to use structural fea-
ture to assemble both types of retrotransposon.
1.4 Overview the work and its contribution
To address the issues and research questions listed above, we proposed novel
methods of assembling repetitive DNA sequence in this thesis.
Firstly, to address the problem of assembling tandem repetitive sequence, we pro-
pose tandem repeat assembler (TRA), a novel method of assembling tandem repeats.
We pre-process the paired end reads which are possible to merge into a long read
from and . From all processed reads, we select out the candidate
reads which are related to tandem repeat. A greedy algorithm then will be applied to
candidate reads to get a draft assembly. All candidate will align to the draft assembly
8
and a majority vote method will be applied to update the result. This dynamic result
will end until a static result generated. Using an experimentally acquired data set for
human chromosome 14, tandem repeats >200 bp were assembled. Alignment of the
contigs to the human genome reference (GRCh38) revealed that 84.3% of tandem
repetitive regions was correctly covered. For tandem repeats, this method outper-
formed state-of-the-art assemblers by generating correct N50 of contigs up to 512 bp.
Secondly, to address the problem of assembling DNA transposon, we propose
DTA, a novel method of assembling DNA transposon. We studied the structure of
DNA transposon and propose a novel method of selecting pairs of head seed and tail
seed using dot matrix. Then we propose a quality based extension method to connect
head seed to tail seed. We evaluate our method on two species, i.e. human chromo-
some 14 and maize chromosome 8. In the measurement of precision, recall and f-
value, our method is superior to previous methods. We also investigate the distribu-
tion of our assembled DNA transposons.
Thirdly, to address the problem of assembling retrotransposon, we propose RA, a
novel method of assembling retrotransposon. Based on the structural feature, we pro-
pose two novel methods to select pairs of seeds for LTR type retrotransposon and
non-LTR retrotransposon respectively. We also propose a recursive method to fill up
the gap between the seeds. We evaluate our method on a benchmark dataset and it
shows that our method outperforms state-of-the-art methods. In addition, we use our
RA method assemble a genome wide dataset which contains reads of nucleolus asso-
ciated domains of human genome. We conducted several analysis on assembled re-
sult and it shows that our RA method is able to assemble retrotransposon in genome
wide scale.
1.5 Thesis organization
This section presents the overall organization of this thesis. As the objective of
this thesis is to address the problems of assemble repetitive DNA sequence, the con-
tent of each chapter is organized as follows:
Chapter 2: This chapter presents a comprehensive survey on sequencing
technology, sequencing reads and assembly methods. We reviewed as-
9
sembly methods in two types, i.e. overlap layout consensus based method
and de Bruijn graph based method.
Chapter 3: We propose a method of assembling tandem repetitive se-
quence. We present our method on how to use pair end reads to correctly
assemble tandem repeats. The method follows a dynamic process which
will stop until a static result assembled.
Chapter 4: We propose a method of assembling DNA transposon. We pre-
sent the methods in two parts, i.e. dot matrix based DNA transposon fea-
ture finding algorithm and quality based extension algorithm.
Chapter 5: We propose a method of assembling retrotransposon. We pre-
sent the methods in three parts, i.e. LTR type retrotransposon feature find-
ing algorithm, non-LTR retrotransposon feature finding and recursive gap
filling algorithm.
Chapter 6: This chapter summarizes the research in this thesis and suggest
the potential works in the future
10
Chapter 2
2 Literature Review
In this Chapter, we present the review of sequencing technology, sequence data
and assembly methods. The review of sequencing technology revels the reason why
repetitive DNA sequence causes issues in DNA sequence assembly. Each sequencing
technology generates unique type of sequence data which is referred as read. Reads
from various technologies differ in length, read quality, read direction and read depth.
Evolution of sequence read drives the development of assembly methods. In this
chapter, we first review sequencing technologies; second we review the data in my
research and finally present existing assembly methods.
2.1 Review of sequencing technology
DNA (Deoxyribonucleic acid) was presented as genetic material in 1944 and its
double helical strand structure discovered in 1953. Features of living beings, such as
eye color (phenotype), are determined by specific information on DNA (genotype).
To understand the nature of organisms, genetics are dedicated to reveal these coding
information. DNA is consisted of 4 basic nucleotides, adenine (A), guanine (G), cy-
tosine (C) and thymine (T). Determining the sequence of DNA in an organism is the
first step to do downstream biological analysis, such as finding pathogenic genes.
Since the first gene, i.e. Bacteriophage MS2 coat protein, was sequenced in 1972 [9],
biologists have been eagerly developing sequencing technology. To date, next gener-
11
ation sequencing (NGS) technology can sequence a whole human genome in one
week.
Sequencing technology provides the order of nucleotides for a DNA sequence.
Thanks to Sanger’s method proposed in 1977 [1], Homo sappiness’ genome se-
quence issued in 2003. Technically, it took 3-4 years and 300 million dollars se-
quencing 3 billion base pairs of human genome with read length of ~900 bases [2].
Despite of the successful application of Sanger’s method, great cost of time and
money limited development of genetics. With advanced biochemistry technology
incorporated, new methods were invented from 1977, including whole genome shot-
gun sequencing [10], hierarchical shotgun sequencing, mixed strategy sequencing,
reduced representation sequencing [11], EST sequencing [12], Pyrosequencing [13],
sequencing-by-synthesis [14], sequencing-by-ligation [15], semiconductor-non-
optical sequencing [16], single molecule real time sequencing [17] and other tech-
nology being developed now.
2.1.1 Description of main sequencing technologies
However, relatively long reads and high accuracy make Sanger’s method combined
with shotgun sequencing dominate in sequencing area nearly 20 years until 2004.
This sequencing technology is regarded as first generation sequencing which is still
served as a “gold standard” in genome projects [18]. While massive parallel sequenc-
ing technologies emerged from 2004, represented by Roche 454, Illumina GA and
ABI SOLiD, they yield high-throughput reads greater than Sanger’s method by sev-
eral orders of magnitude [19]. Fast sequencing speed, low cost on per base, as well as
high-throughput of reads, advance sequencing technology into Next Generation. It
gives the opportunity of sequencing thousands of large genomes for human, animals,
and plants [20, 21, 22]. Table 2.1 lists the details of main sequencing technologies.
From the assembly point of view, these sequencing technologies differ in:
The size of inserts
Whether the sequencing was uni or bi-directional
Library construction
Cloning vector
12
Selection of clones to be sequence
Number of DNA molecules sequenced in parallel
Length of read
Quality information for each base
Table 2.1: Description of main sequencing technologies.
Year Sequencing Technology Description
1977
Sanger’s method Sanger’s method is based on chain termination
method. In a sequencing reaction, a primer is
combined with a fragment of single-strand DNA.
After polymerase adding to the mix, dideoxynu-
cleotide triphosphates (ddNTPs) and deoxynu-
cleotides (dNTPS) will synthesize single-strand
DNA. When a ddNTPS is added to the strand,
synthesis will stop. And label on it will tell the
synthesis position. After gel electrophoresis and
autoradigraphy, fragments will be put in four
lanes to get final order.
1979 Shotgun sequencing Shotgun sequencing randomly shears the ampli-
fied DNA sequenced into fragments. Average
number of reads cover the given nucleotide is
predefined, to make sure all nucleotides in DNA
sequence will be sequenced. After the fragments
are sheared into suitable length, sanger’s method
will read these fragments [23].
2004 Roche 454’s technology Roche 454 uses pyrosequencing technology [24].
First, DNA library with 454-specific adaptors
denatured and combined with beads. After emul-
sion PCR amplification [25], synthesis will
begin. Specific dNTP will complement to the
13
single-strand DNA with ATP sulfurylase, lucif-
erase, luciferase, luciferin, polymerase and
adenosine 5’ phosphosulfate. When synthesis is
processing, pyrophosphate will be released then
transformed to ATP. ATP drives luciferin into
oxylucierin, which yields visible light [26]. By
telling different light, nucleotides can be deter-
mined.
2006 Illumina’s technology Illumina uses sequencing by synthesis technolo-
gy [27]. DNA library with adaptors attached to
the flowcell. Given Bridge amplification, cloned
DNA fragments form clusters. By adding linear-
ization enzyme, four types of nucleotide with
fluorescent dye will complement to the single-
strand DNA. The color of fluorescent dye will
then be captured by a charge-coupled device, to
tell the specific nucleotide.
2007 ABI SOLiD technology ABI SOLiD uses sequencing by ligation tech-
nology [28].First, DNA library with adaptors is
amplified by PCR, then captured by beads. Af-
terwards, an 8 base-probe will move alone DNA
fragment. When base-probe complementary to
the fragment, the last 3 nucleotides with fluores-
cent dye are cleaved and system will capture the
color of 4th
base. Then base-probe moves on
from the cleavage site, same routine will occur.
After 5 rounds moving, colors of all bases are
captured, and sequence can be deduced.
2009 PacBio technology PacBio’s technology makes polymerase bound to
DNA library. And these DNA templates are im-
mobilized on a 50 nm-wide wells, which block
14
light from fluorescent dye but allow energy pen-
etrated. When synthesis is processing, energy
will excite the fluorescent on nucleotide. And a
specific pulse will be captured based on type of
fluorescent, so that the corresponding based can
be deduced.
2011 Ion Torrent technology Ion Torrent utilizes the protons released when
nucleotide synthesis [29]. First, DNA library
combined with primers immobilized on 3-micron
diameter beads, are amplified under emulsion
PCR. Then these templates are put into proton-
sensing wells. After sequencing beings, specific
nucleotide is incorporated and protons will be
released. Signal is subsequently detected to de-
termine the base.
2.1.2 Technical comparison of sequencing technologies
Next generation sequencing technologies were beginning with Roche 454. It
abandons bacterial cloning step replaced by PCR amplification [30]. No matter what
specific base recognition method is adopted, analysis processes are massively paral-
lel. These features fasten the speed of DNA sequencing and beat Sanger’s method.
However, different base detecting methods have their own inherent limitations. For
example, Illumina’s sequencing technology would fail the detection of bases near 3’
and 5’ end of DNA fragments. And all next generation sequencing technologies gen-
erate shorter reads than Sanger’s method. Table 2.2 compares these sequencing tech-
nologies from the point of output.
15
Table 2.2: Comparison of sequencing technologies
Read
Length
Quality Yield
/Run
Time
/Run
Cost
/Mb
Paired
end
Sanger’s
method
400-
900bp
Mostly>Q35 84 Kb 3 hours $2400 Yes
Roche 454
GS FLX
700bp Mostly>Q30 0.7 Gb 24 hours $10 Yes
Illumina
HiSeq 2000
101bp Mostly>Q30 600 Gb 3-10 days $0.07 Yes
ABI
SOLiDv4
35-50bp Mostly>Q30 120 Gb 7-14 days $0.1 Yes
Ion Torrent
PGM
200-
400bp
Mostly Q20 1 Gb 2 hours $1 Yes
PacBio
RS
Up to
2200bp
< Q10 100 Mb 2 hours $2 No
2.2 Review of data in sequence assembly
From the perspective of repetitive sequence assembly, we are interested in the
type of data which are generated from sequencing technology. Also, we are interest-
ed in repetitive sequence data which are available to be utilized in sequence assembly.
Sequence data is also referred as sequence reads which are consisting of 4 nucleo-
tides (Adenine, Guanine, Cytosine and Thymine).
2.2.1 Types of reads
There are three types of reads, i.e. single end reads, paired end reads and mate pair
reads. Generation processes are demonstrated in Figure 2.1-2.3. Due to different se-
quencing technologies used, these reads are different in terms of single or pair, read
length and read orientation.
16
1. Single End reads: In the demonstration of Figure 2.1, adaptors (A1, A2) are
ligated to the fragmented sample DNA. After amplification, clusters of DNA
fragments are generated. Nucleotides will be read from primer sites SP1. Due
to the limitation of sequencing technology, it is only able to sequence the first
few nucleotides of a DNA fragment. For example, you have a DNA fragment
sample construct with average 500 bp. For single read output, you may only
get first 100 bp of each DNA fragment.
Figure 2.1: Generation of single end reads [31].
2. Paired End reads: Paired end reads are two Single End reads coming from
both end of a DNA fragment, on different strand. In the demonstration of
Figure 2.2, adaptors (A1, A2) are ligated to the fragmented sample DNA. Af-
ter amplification, clusters of DNA fragments are generated. Nucleotides will
17
be read from primer sites SP1 and SP2. The orientation of these two reads is
FR (Forward Reverse) where forward read is from 5’ to 3’ and reverse read is
from 3’ to 5’.
Figure 2.2: Generation of paired end reads.
3. Mate Pair reads: Mate pair reads are two single end reads coming from both
end of DNA fragment on different strand, but with longer insert length than
paired end reads. In mate pair sequencing, fragmented sample DNA is end-
biotinylated to tag the mate pair section. After self-circularization and re-
fragmentation, DNA fragments with biotin tag are amplified. Followed by
18
similar procedure of paired end sequencing, mate pair reads will be generated.
The orientation of these two reads is RF (Reverse Forward) where reverse
read is from 3’ to 5’ and forward read is from 3’ to 5’.
19
Figure 2.3: Generation of mate pair reads.
2.2.2 Repeat reference
In thesis, we address the issues of how to correctly assemble repetitive DNA se-
quence. So we are interested in the sequence data of available repeat sequence. Most
available repeat references were identified in an assembled DNA sequence. And the-
se information help the research for assembly of repetitive DNA region, but available
resource is limited. In 1961, first satellite DNA was discovered [32] and later trans-
posable elements were discovered in Maize genome [33]. At that time, DNA was
merely regarded as macromolecule of heredity. Finally in 1968, researchers recog-
nized that DNA not only contains genes, but also repetitive elements [34]. All these
discoveries revolutionized the knowledge against non-coding part of DNA sequence.
To date, it is widely accepted that repetitive sequence is associated with human dis-
eases [35], genome evolution [36], gene evolution [37], and gene expression [38].
In terms of length and mobility, repetitive DNA sequences are classified into two
categories, i.e. tandem repeats and interspersed repeat. Table 2.3 gives the descrip-
tion of repetitive DNA sequences. DNA transposon is not included in interspersed
repeat category. Comparing to retrotransposon, DNA transposon differs in the mo-
lecular mechanism of formation, but not in mobility. Similar reason to segmental du-
plications and other classes, they are not included in this table either.
Table 2.3: Description of repetitive DNA sequences.
Tandem
Repeat
Microsatellite Microsatellite is also known as simple sequence
repeat. It is repeating sequence of 1-10 base pairs.
The pattern is repeated and repetitions are directly
adjacent to each other.
Minisatellite Minisatellite is also known as variable number tan-
dem repeat. It is repeating sequence of 10-60 base
pairs. The pattern is repeated and repetitions are
directly adjacent to each other.
Satellite Satellite is repeating sequence of long length. The
20
pattern is repeated and repetitions are directly adja-
cent to each other.
VNTR VNTR is variable number tandem repeat. The
length of repeat always shows variation between
individuals.
Interspersed
Repeat
SINE
SINE is short interspersed nuclear elements. They
are normally repetitive DNA sequences of 100-300
bp and spreading throughout genome.
LINE
LINE is long interspersed nuclear elements. They
are normally repetitive DNA sequences of more
than 300 bp and spreading throughout genome.
LTR
LTR is long terminal repeats. They are duplicated
hundreds of thousands times and located at end of
retrotransposons.
Table 2.4: Databases of repetitive DNA sequence.
Database Organism Classification Scale Availability
TIGR [39] Plant Tan. & Inter. 12 http://plantrepeats.plantbiology.msu.edu
/index.html
PSSRDb [40] Prokaryotes Tandem 85 http://pssrdb.cdfd.org.in
SSRD [41] Human Tandem 1 http://www.ccmb.res.in
TRDB [42] Mix Tandem 22 https://tandem.bu.edu/cgi-
bin/trdb/trdb.exe
MIPS [43] Plant Tan. & Inter. 11 http://mips.helmholtz-
muenchen.de/plant/recat/
TREP [44] Plant Tan. & Inter. 2 http://wheat.pw.usda.gov/ITMI/Repeats/
PlantSat [45] Plant Tandem 23 http://w3lamc.umbr.cas.cz/PlantSat/
Repbase [46] Mix Tan. & Inter. 31 http://www.girinst.org/repbase/update/in
dex.html
The significance of repetitive DNA sequence drives the development of repeat
finder algorithm. Served as the most widely used program, RepeatMasker identifies
repeats by comparing query sequence to reference repeat database [46]. However,
21
the necessary of available repeat sequence limits the capability of reference based
methods. Instead, ab initio algorithms try to find repeat sequence without reference,
such as ReAS [47] and RepeatScout [48]. For more information about repetitive se-
quence discovery methods, please refer to these review papers [49, 50, 51].
Public available repeat databases are listed in Table 2.4. In Table 2.4, Classifica-
tion indicates if a database contains tandem repeat or interspersed repeat. Whereas
Scale indicates the number of species a database includes. GenBank [52] and ENA
such databases are not included, as they are not dedicated to repeat sequence.
2.3 Review of assembly methods
Assembly is the process of merging all reads back into complete sequence. It is
the fundamental step to undertake downstream biological analysis, such as polymor-
phism discovery, genome wide association. The necessary of assembly, makes it
bind with sequencing technologies and dependent on specific reads they produce.
Moreover, repeat sequence in DNA makes assembly complex, even a dilemma when
the length of read is too short. Several papers dedicated to assembly methods were
published [8, 53, 54, 55, 56, 57]. Except performance comparison, one fact they pre-
sent is, the result sequence assembled by these methods is shorter than it supposed to
be.
In this section, assembly methods based on next generation sequencing data will
be presented, as well as the issues in this area. Depending on the availability of refer-
ence genome, assembly algorithms can be classified as either reference-guided as-
sembly or de novo assembly. In this thesis, assembly algorithms are grouped into 3
categories:
Overlap-Layout-Consensus (OLC) model based method
De Bruijn graphs (DBG) model based method
Greedy extension based method
In this section, assembly algorithms will be reviewed in 2.3.1. Representative of
each category will be discussed in section 2.3.2 (OLC model based method) and sec-
tion 2.3.3 (DBG model based method). The measurement of assembly algorithm will
22
be presented in 2.3.4. And issues of current assembly algorithms will be discussed in
2.3.5.
2.3.1 Assembly Methods overview
The history of assembly algorithm is profound influenced by sequencing technol-
ogy. Following the birth of Sanger’s sequencing, in 1980, the model utilizing over-
lap-layout-consensus (OLC) to assembly sequencing reads was first proposed [58].
As shotgun sequencing widely applied later, sophisticated assembly algorithms were
developed based on OLC model, such as ARACHNE [59], Atlas [60], Celera As-
sembler [61], CAP3 [62], Euler [63], Newbler [64], PCAP [65], Phusion [66], Phrap
[67] and RePS [68].
During this time, an alternative method was proposed in 2001, known as Eulerian
path [69]. However, it was not valued during the age of Sanger’s sequencing. With
the advent of next generation sequencing (NGS) in 2004, short reads make OLC
model perform not as well as it handles long reads. Instead, Eulerian path approach
takes advantage of short read, which makes it well suited in NGS assembly. Thus,
novel methods were proposed based on de Bruijn graph model (DBG), which is first
adopted by Eulerian path approach. These include EULER [70], ALLPATHS [71,
72], ALLPATHS-LG [73], Velvet [74], ABySS [75], FragmentGluer [76], SO-
PAdenovo [77, 78] and Sparse bitmaps [79]. With the generation of graft genome of
panda [80] and cucumber [81], genome assembly with NGS reads culminated in a
new stage.
Unlike previous models, greedy extension based methods do not model sequence
reads into graph. Instead of regarding reads as vertices, greedy extension methods
regard reads as string and try to extend them greedily. These include SSAKE [82],
SHARCGS [83], VCAKE [84], QSRA [85] and SGA [86]. Other methods try to tan-
gle assembly from different point of view. For example, Ray [87] combines sequenc-
ing reads from Illumina and Roche 454 to reduce contig number and sequence error,
while Taipan [88] combines greedy extension and graph based method to optimize
assembly.
Table 2.5 gives an overview of assembly methods. It abstracts these methods in 7
properties, i.e. type, QC, PE, PE-C, reference, repeat and parallel. QC represents
23
whether a method adopt quality control. PE represents whether a method depends on
paired end read. PE-C represents if a method uses paired end read to construct con-
tics. Reference represents if a method is based on reference. Repeat represents if a
method has a dedicated module tackle repeats. Parallel represents if a method sup-
ports parallel computing.
Table 2.5: Overview of assembly methods.
Type QC PE PE-C Reference Repeat Parallel
ARACHNE OLC Yes Yes
Atlas OLC Yes Yes
Celera OLC Yes
CAP3 OLC Yes
Newbler OLC Yes
PCAP OLC
Phusion OLC Yes
Phrap OLC Yes
RePS OLC Yes
EULER DBG
ALLPATH DBG Yes Yes Yes
Velvet DBG Yes Yes
ABySS DBG Yes Yes
SOAPdenovo DBG Yes Yes
SSAKE DBG
SHARCGS DBG
VCAKE DBG
QSRA DBG Yes
SGA DBG Yes
Ray Other Yes Yes
Taipan Other Yes
24
Regardless of data preprocessing, repeat assembly and other concerns, whole ge-
nome sequence assembly follows common 3 steps:
1. Contig construction: Contig is the continuous DNA sequence built by the
overlap reads. This always refers to the long unambiguous sequence.
2. Scaffolding: Scaffolding refers to the determinant of orientation and orders
of contigs [89, 90, 91]. This process can be seen as the linkage of contigs,
which is called scaffold as well.
3. Gap closure: Gap closure tries to fill the gaps in the scaffolding between
contigs. This process attempts to complete whole DNA sequence [92, 93].
2.3.2 Overlap-Layout-Consensus Based methods
Normally, Overlap-Layout-Consensus based methods follow 3 steps.
1. Overlap step: all-against-all, pair-wise reads comparison will be adopted.
This process depends on the minimum number of overlap bases and simi-
larity between the overlap bases. For similarity, the number of mismatch
and indel are taken into consideration. When assembly of long reads, reads
related to repeat sequence are always masked during this procedure.
2. Layout step: based on the overlap information of step 1, reads regarded as
vertices are connected. Sequence errors are usually corrected in this stage.
After this, an overlap graph will be drew.
3. Consensus step: based on the graph generated in step 2, multiple sequence
alignment is adopted and determines the Hamiltonian path of the layout
graph. Hamiltonian path tries to visit each node in graph exactly once. To
merge all nodes in this path, one can get a continuous sequence.
In Figure 2.4, we demonstrate the process of OLC based method. In this case, a 32
bp DNA is sequenced. The read length is 10 bp, and the minimum number of overlap
is set as 5 bp. All 6 reads are aligned under the original DNA sequence, and OLC
graph is drafted based on the overlap information. The next step is to find a path, vis-
25
iting every node exactly once. In this scenario, result would be easily deduced:
R1>R2>R3>R4>R5>R6.
Figure 2.4: Illustration of OLC based method.
We reviewed the related works of OLC based method as follows:
ARACHNE
ARACHNE begins with the trimming procedure. Based on Phred quality, ex-
tremely low score bases are trimmed, while reads mainly consisting of low
quality bases are eliminated. And also, reads sequencing from bacterial host
and cloning vector are abandoned. After this, ARACHNE starts overlap detec-
tion between pair-wise reads. Instead of all-against-all reads comparing (
times comparisons), ARACHNE compares k-mer, which scales linearly. At
this stage, k is set as 24, and 24-mers with extremely high frequency are ex-
cluded, as they are potential repeats. Based on the overlap information of k-
mers, program identifies their corresponding reads. In this way, overlap of
reads will be deduced. Then, layout stage follows 3-step alignment module,
26
which is known as merge-extend-refine. It is followed by an error correction
step in which majority rule applied. Between overlap reads, a penalty score is
calculated based on the discrepant base, its quality and its flanking bases. This
comes to an overall penalty score to the alignment. With penalty score, final
layout is generated and contigs can be easily merged. Scaffolding procedure is
straightforwardly linking two contigs, by plasmid reads. For gap in the scaf-
fold, ARACHNE tries to find a repeat contig, which does pairwise overlapping
the contigs in scaffold.
Experiments: Simulated reads were generated, providing 10-fold coverage of
genome H. influenza, S. cerevisiae, D. Melanogaster and human chromosomes
21 and 22. The assembly of D. melanogaster made a ~98% coverage with N50
contig length of 324 Kb and N50 Scaffold length of 5143 Kb. Small errors
were detected almost 1 per 1 Mb.
Atlus
Atlus combines reads generated from BACs and Whole-genome shotgun
(WGS). Reads from BACs offer localization assembly, which reduce computa-
tional power; while reads from Whole-genome shotgun provide different insert
sizes, which benefits scaffolding. Assembly process begins with trimming. It is
notable that trimmed reads are only used to compare but not assembly. Poten-
tial repeat reads are discarded during pairwise overlap, by setting maximum
frequency as 12. K-mer value is set as 32-mer, and maximum mismatch be-
tween any 32-mer is set as 3%. All BAC reads and WGS reads are compared,
and only 6 WGS reads were kept by computing the weight. After this, overlap
graph is generated. In the gap closure step, WGS read with different insert
length was utilized to link scaffolds.
2.3.3 De Bruijn Graph based methods
The core of de Bruijn graph based method follows 4 steps.
27
1. All substring of length K is derived from sequencing reads. This is known
as . Based on the frequency of , reads with high frequen-
cy can be masked during this procedure.
2. De Bruijn graph is generated based on all . Each is re-
garded as vertex. Vertex i and j are connected by a directed edge (i, j), if
there is in DNA sequence that contains i as prefix and j as
suffix.
3. Based on the de Bruijn graph, unique path is merged into unambiguous se-
quence. Unambiguous sequence combined with remaining part of de
Bruijn graph, forms a repeat graph.
4. Based on the repeat graph, traverse algorithm will be applied to find an Eu-
ler path. Euler path tries to visit each edge in the graph exactly once. Then,
merging nodes in this path will form a long continuous sequence.
Figure 2.5: Demonstration of de Bruijn graph generation.
28
In Figure 2.5, we present the process of de Bruijn graph generation. First, 3-mer is
generated from a DNA sequence. Then two 3-mers are connected if the suffix of one
3-mer is same as the prefix of the other 3-mer. A 3-mer graph will be generated after
connecting. Finally, nodes in the single path are merged into unambiguous sequence,
forming a de Bruijn graph [94].
We reviewed the related works of de Bruijn graph based method as follows:
ABySS
ABySS short for assembly by short sequence, a de novo genome assembly
method, which is de Bruijn graphed based. The innovation is from the per-
spective of computing efficiency. It adopts parallel computing techniques to
construct de Bruijn graph and use hash function to store sequence information.
The method contains 2 stages, in which at first they input sequences and con-
struct data structure to generate contigs; second they exploit paired-end in-
formation to do contig merging. Before stage 2, read errors, “dead-end”
branches in graph, are eliminated. The length of these branches is shorter than
a threshold, counted from the ambiguity point by tracing backwards from
their end. In stage 2, two contigs are deemed as linked only because more
than 5 paired-end read can be joined by both of them. Following, linked con-
tigs are merged by searching the de Bruijn graph in purpose to find a unique
path. 36-mer is the parameter they choose to generate the contigs.
Experiments: The genome of an African male publicly released by Illumina
were sequenced in 3.5 billion paired-end reads. By parallel assembling using
ABySS, 2.76 million contigs with more than 100 base pairs were generated,
which represents 68% of the reference human genome. N50 size is 1499 bp.
Velvet
Velvet begins with hash storing read and k-mer relations. The value range of
k is allowed from 21 to 25. As lower value of k, it would lead to increase the
complexity of de Bruijn graph. After generation of de Bruijn graph, two steps,
“simplification” and error removal are adopted. In simplification, unambigu-
29
ous path in the de Bruijn graph is merging into one node, which saves
memory space and calculation time. In error removal, 3 modules are function-
ing:
1. Removing tips: tip is a chain of nodes that is disconnected on one end.
This happens when k-mer inherits the sequencing errors from sequenc-
ing read. Simply removal these tips does not affect the connectivity of
graph.
2. Removing bubbles: a Tour Bus algorithm is adopted. Two paths are con-
sidered as redundant if they start and end in the same node. This is called
a bubble, which is created by errors or biological variants.
3. Removing erroneous connection: erroneous is removed based on a user
defined threshold. These connections do not create any loop and covered
by little reads.
After these processes, “Breadcrumb” is adopted to make contig merged
through repeat area. And this is achieved by mapping paired end reads.
Experiments: Velvet is tested on short read data sets whose read length rang-
es from 25-50 bp. With paired read only, up to 50-Kb N50 length was gener-
ated for simulation of prokaryotic data. For real data, velvet generated almost
8 Kb contig with a prokaryote and 2 Kb with a mammalian BAC.
2.3.4 Measurement
Gold standard is not available to measure the quality of a sequence assembler.
Based on the different projects, researchers defined their metrics from their own per-
spectives. From biological point of view, it expects sequence assembly method to
generate complete sequence. Whereas, from the computational point of view, it is
crucial to consider the memory cost, parallel independent and computational time.
The absent of gold dataset for test is another factor causing the difficulty of meas-
urement. When consider an assembly method, one cannot trust the quality of assem-
bling one species from the result of assembling another species [55]. Moreover, the
fast development of sequencing technology will yield novel type of sequence reads,
which makes corresponding evolution of sequencing assembly. It makes the proposal
30
of gold dataset impossible. However, there are still some common measurements
widely adopted. In Table 2.6, we present existing measurement of assembly methods.
Table 2.6: Measurement of assembly methods [95].
Metric name Units Description
N50 - A weighted median of the lengths of items. N50 was
calculated by ordering all contigs from the longest to
the shortest, then adding the length from the longest
until reaching 50% of total length of all contigs. The
last added length is the size of N50. With regard to
assemblies the items are typically contigs or scaf-
folds.
NG50 - Whereas N50 sets the median in relation to the total
length of all items in the set, we define NG50 to be
normalized by the average length of the a1 and a2
haplotypes instead of the total length of all sequences
as in N50; it is thus more reliable than N50 for com-
parison between assemblies.
CPNG50 bp Contig path NG50. The weighted median of the
lengths of contig paths.
SPNG50 bp Scaffold path NG50. The weighted median of the
lengths of scaffold paths. Scaffold paths represent
maximal concatenations of contig paths and scaffold
breaks that maintain correct order and orientation.
Structural
errors
Counts An error within a contig or scaffold. Errors include
intra- and inter-chromosomal joins, insertions, dele-
tions, simultaneous insertion, and deletions and in-
sertions at the ends of assembled sequences.
31
CC50 bp Correct contiguity 50. The empirically sampled dis-
tance between two points in an assembly, where the
two points are as likely to be correctly aligned as not.
Substitution
errors
Counts per
correct bits
Number of substitution errors per correct bit. Substi-
tution errors are columns in the alignment where the
a1 and a2 haplotypes contain either the same base
(homozygous) or different bases (heterozygous) and
the alignment contains a base (or IUPAC symbol)
different from either a1 or a2. The metric uses a bit
score to allow for IUPAC symbols.
Copy number
errors
Proportions For a given haplotype column in the MSA, the copy
number of the simulated genome can be described as
an interval (min, max). Assemblies with a copy
number outside of this interval are classified either as
an excess, for being above the interval, or a deficien-
cy, for being below the interval.
Coverage Percent The coverage is the percent of columns in the MSA
of the target (the whole genome, regions of a specific
annotation type, etc.) that contain positions of the
assembly.
Genetic
correctness
Percent The genic correctness is the percentage of base pairs
in spliced transcripts from the haplotype sequences
that align to the assembly with 95% coverage using
WU-BLAST.
E-sizes - The expected size of contig or scaffold of a location.
∑
Where the length of contig C, and G is the genome
length estimated by the sum of all contig lengths.
Others - Other assessment involves the investigation of ge-
nome feature, such as [96] examined the fraction of
32
certain genes, which presented in all genomes.
2.3.5 Issues of Assembly methods
1. Limited application to different sequencing read. DNA sequencing technolo-
gy affects assembly methods directly in read length, sequencing coverage and
base quality. Sanger’s long read is more suitable for Overlap-Layout-
Consensus based method than de Bruijn graph based method. Conversely,
next generation sequencing’s short read is more suitable for de Bruijn graph
based method. It is hard to see Sanger’s sequencing project to a 33× sequenc-
ing depth, as it generates redundant information for assembly methods. How-
ever, it is necessary for NGS technology having 33× sequence depth to assure
completely sequencing [97]. There is no model adapting the change of se-
quencing technology, though some attempts combined advantages of existing
models [88].
2. Noise data always exist. Especially for next generation sequencing, base call-
ing procedure is dependent on the color emit from fluorescent dye. In fact, the
location and shape of color determine the base call. However these infor-
mation are not always consistent, since the close deploy of flowcell. Wrong
base calls make sequence read erroneous.
3. Base quality information is not fully utilized. To measure correctness of base
calling, quality score is statically calculated. However, these information is
not fully utilized in sequence assembly method. Though several methods
adopted trimming process before assembly, others directly assemble using
raw reads without preprocessing. Except data preprocessing, quality infor-
mation is abandoned. However, it can be used as weight during overlap com-
parison.
4. Paired end reads and mate pair reads are only used for scaffolding. Some as-
sembly method only accepts paired reads [75]. But like others, they are only
used for contig linkage and gap closure. [98] used paired end reads to con-
struct contig, but only on small genome. The features of paired end infor-
mation are not fully reflected.
33
5. Inherent limitation of existing model. In Overlap-Layout-Consensus methods,
Hamiltonian path needs to be found after generation of layout graph. Howev-
er, finding a Hamiltonian path is NP hard, making it not suitable for compli-
cated graph [99]. For de Bruijn graph based method, the necessary of k-mer
generation leads to information loss of original sequence read. It is possibly
making a problem that path in de Bruijn graph is not supported by the under-
lying reads [100].
6. Repetitive DNA sequence problem. The existence of repeat makes sequence
assembly always bypass this issue. In Overlap-Layout-Consensus method, k-
mer frequency is counted to mask repeat related sequence. And those reads
are not included in layout graph so that subsequent contig loses repeat infor-
mation. In de Bruijn graph based method, traverse algorithm always stops at
loop path, facing the choice of direction, which loses repeat information.
Though NGS tries to compensate short length with high sequencing depth,
the fraction of problems caused by repetitive repeats is only depending on
read length [97].
7. No standard measurement and benchmark data set. Published assembly
methods are using either new simulated data or their own real sequencing da-
ta. It is hard to compare the performance of these methods. Different proto-
cols lead these methods data set sensitive. Longest length of contig cannot
distinguish superior performance.
34
Chapter 3
3 Innovative method of tandem repeti-
tive sequence assembly
3.1 Introduction
The next generation sequencing (NGS) technology, which provides high-
throughput sequence data at low costs, is a major development towards the under-
standing of genomic information at population level, enabling genomic investigation
of the diverse living species [101]. These genomic data can impact prominently on
medicine, agriculture and environmental conservation. NGS-derived data are dis-
played as numerous short sequence reads and the daunting task in interpreting NGS
data involves de novo assembly of these short sequences into contigs, scaffolds and
then the genome sequences, or comparative alignment of these short reads to known
reference genomes [102, 103]. In this process, one challenge to the sequence assem-
35
bler tools is the erroneous reads which require filtering and corrections. A second
and more daunting challenge is the large number of repetitive sequences in the reads
that complicate graph assembly by converging or collapsing in the path and are diffi-
cult to separate [5, 54, 95].
The eukaryote genomes are inflated by non-coding sequences in the forms of in-
tervening introns and intergenic sequences and, in early genetic studies, these ge-
nomic regions were largely grouped as `junk or selfish’ DNA [104, 104, 106]. The
human genome consists of approx. 25,000 genes which collectively occupy merely
approx. 1% of the entire genome [107], albeit recent studies showing low level tran-
scription of many non-coding genomic regions as regulatory, long non-coding RNAs
(lncRNA) in the nuclei [108, 109]. An important feature of these non-coding regions
is the abundance of repetitive elements of various lengths, forming more than 55% of
the human genome [110]. Among these repetitive sequences, the transposable inter-
spersed repetitive elements that are either DNA transposon or retroelements, that in-
cludes long terminal repeats (LTRs), long interspersed element (LINEs), and short
interspersed elements (SINEs), contribute hugely to genome size with tandem repeats
(i.e. satellite DNA, microsatellite and low complexity sequences) contribute to lesser
extent [110, 111]. However, tandem repeats account for 3% of the human genome
[112].
Unlike the genome-wide interspersed distribution of most transposable elements,
tandem repeats are found in 10-20% eukaryote genes and promoters [113]. In the
human genome, tandem repeats are found in 6% of the coding regions [114, 115].
Tandem repeats are highly unstable during replication giving rise to high frequencies
of repeat length polymorphism and this intimately affects gene expression and the
structure of the gene products with some causing variation in cellular phenotype,
evolution progression, or pathological development [113, 114, 116]. For example,
Huntington’s disease is caused by variation in tandem repeats [117]. Expansion of a
CAG repeat in the androgen receptor gene caused mutation leading to X-linked spi-
nal and muscular atrophy [118]. Besides monogenic diseases, tandem repeat poly-
morphism are also associated with polygenic diseases [119]. Somatic variations of
tandem repeats have been used as biomarkers for cancers [120]. However, the chal-
lenge in accurately detecting these repetitive sequences in the genome makes it diffi-
36
cult to investigate their disease-associated variations. Firstly, while NGS genomic
data can be obtained in high throughput, these sequence reads are short and often fall
inside repetitive regions during assembly causing short, error-prone scaffolds [121].
Omitting these reads in analysis leads to poor coverage of these genomic regions.
Secondly, PCR detection of tandem repeat regions is interfered by the generation of
stutter products [122].
The de Bruijn method is widely used in graph construction. It works by breaking
NGS sequence reads into k-mers as vertices or nodes and assembly of these k-mers is
directed by using k-1 bases as edges [102, 103]. Traverse methods are used to gener-
ate contigs from these complex graphs. K-mers derived from repeat-containing reads,
especially when the corresponding repetitive regions are longer than the k-mers, loop
structures or bubbles will form in the graph, confounding subsequent contig genera-
tion. For example, tour bus algorithm was applied in the Velvet method to merge
these bubbles by consensus but it embeds uncertainty and error information into the
consensus path [74]. In ABySS, bubbles are removed based on read coverage select-
ing path in the graph with maximum support [75]. The core consideration in these
strategies is to align paired-end reads back into the draft de Bruijn graphs seeking
alignment supports. Similar strategies are adopted in other methods [73, 94, 123]. A
common challenge to these methods is the over-simplification of repetitive regions
leading to uncertainty and errors. Assembler tools emphasizing on tandem repeats
have been developed but improvements in the assembly outcome were partial or
conditional [124, 125, 126].
In this chapter we report TRA (Tandem Repeat Assembler), a new method focus-
ing on the assembly of tandem repeats and evaluate this method using human chro-
mosome 14 as a benchmark dataset [54]. With regard to tandem repeat assembly, this
new method outperformed other state-of-the-art assemblers.
The rest of this chapter is organized as follows. First, we give definition to tandem
repeats and overview the method of TRA algorithm. Second, we present the details
of our method. Third, we evaluate our method by comparing to the-state-of-art meth-
od using benchmark dataset. Finally, we conclude this chapter.
37
3.2 Methods
3.2.1 Tandem repeat definition
If a repeat can be aligned to multiple locations in genome, then alignment scores
(Bit score) of these locations will be similarly high. We consider this repeat as a
transposon repeat. While a repeat can only be aligned to one location with high score,
we consider it as tandem repeat. So we aligned all tandem repetitive regions >202 bp
(downloading from TRDB), against chromosome 14 of human reference (GRCh38).
For each repetitive region, we calculate Δbs which reflects how much difference be-
tween its best alignment at one location and its second best alignment at another lo-
cation. Δbs = (Bit scorebest_align-Bit scoresecond_best_align)/Bit scorebest_align. The less Δbs
value is, the more locations a repetitive region can align to. Extremely, if the admis-
sion condition of tandem repeats is Δbs=0, then all repetitive regions are considered
as tandem repeats. However, since they have identical copies at different locations,
these repetitive regions actually should be considered as the transposon repeats. In
this thesis, repetitive region with Δbs >0.65 is defined as a tandem repeat.
The objective of this work is to assemble tandem repeat regions in which a repeat
motif, also known as a repeat pattern, is duplicated adjacently. We downloaded tan-
dem repeats found in human chromosome 14 from the Tandem Repeat Database [41].
1302 repetitive regions with motifs longer than 101 bp, were selected from a total of
27552 repeat regions and these were then aligned against chromosome 14 (GRCh38).
For each repetitive region, bit scores of its alignments were examined and, if the bit
score decreases > 65% between the first two best alignments at different locations,
we define this region as a tandem repeat. In Figure 3.1, how tandem repeat is distin-
guished from all other repetitive regions is illustrated. Based on these criteria, 169
repetitive regions are defined as tandem repeats, accounting for ~13% of the 1302
shortlisted repetitive regions (Figure 3.1).
38
Figure 3.1: Tandem repeats in human chromosome 14.
3.2.2 Algorithm overview
The TRA algorithm exploits paired-end reads with short insert size and repetitive
motifs to assemble tandem repeat genomic regions. The algorithm includes 4 steps.
In the first step, each pair of reads is merged into one read (Figure 3.2a). In the se-
cond step, all processed reads are aligned to a list of known repeat motifs extracted
from the genome and a read is selected as a candidate read for further processing if
its substring has k-bp overlap in length with any repeat motif (Figure 3.2b). Thirdly,
these candidate reads are used to generate draft contigs through overlap layout pro-
cessing. Reads that share the minimum threshold overlap of length n are merged and
the longest merged sequence is used as a draft contig (Figure 3.2c). Lastly, based on
this draft contig, all candidate reads are aligned simultaneously. By voting each base
from the aligned read, a new draft contig is achieved. Contig update continues till
voting result is stable (Figure 3.2d).
Figure 3.2 demonstrates an example of tandem repeat assembly. (A) Illumina
paired-end reads merging process. In this case, the read length is 100 bp covering
span of 170 bp. There is 30 bp overlap between Readforward and Readreverse. So they
are merged into one read with 170 bp in length. (B) Candidate selection. Given a
39
repeat motif, every read is compared with it. In the figuer, reads a, b and c are
selected out as candidates, since they have n bp overlap with the repeat motif. (C)
Candidate reads overlap layout and draf contig generation. Candidate reads are
merged into one if they have an overlap over a threshold. In this case, read a and b
are merged into one. (D) Contig generation. All candidate reads are aligning under
the longest draft contig, contig_1. Afterwards, new contig, contig_2 is generated by
voting each base from aligned reads. Then, contig_2 will server as the draft contig
aligned by all candidate reads again to get a new contig_3. This process will stop
until the contig_n generated is the same as contig_n-1.
40
Figure 3.2: Workflow of tandem repeat assembler.
41
3.2.3 Merge of paired-end reads
Illumina’s Paired-end reads provide 5’-3’ DNA sequences from both strands of
each DNA fragment. Therefore, each pair of read can be expressed as “Readforward-
gap-Readreverse”. Readforward is sequence read from the forward strand and Readreverse
stands for sequence read from the reverse strand. When a long DNA fragment is se-
quenced, sequences will only be generated from both ends leaving a gap which is not
sequenced. If the paired-end reads span sufficiently long from both ends to cover re-
petitive regions, these repeats can be correctly assembled after gap is filled. If the
insert size of paired-end reads is not twice as long as the read length, there will not
be a gap between the two reads and sufficient overlap may be found between the two
opposite reads. Insert size here is counted from the 5’ end of Readforward to the 5’ end
of Readreverse. TRA exploits overlap in each paired-end reads with short insert size by
merging “Readforward-gap-Readreverse” into Readmerged. Specifically, TRA first converts
Readreverse into its reverse complement and then aligns it with its paired Readforward. If
the tail region of Readforward overlaps with the head region of the reverse-
complemented Readreverse, TRA merges the two reads as Readmerged. Otherwise, the
“Readforward-gap-Readreverse” structure remains unchanged. After the paired-end reads
are thus processed, TRA selects candidate reads for tandem repeat assembly.
3.2.4 Candidate reads selection
In TRA, repeat motifs were used to select related candidate reads for subsequent
assembly of specific tandem repeats. In this study, tandem repeats on human chro-
mosome 14 were downloaded from the Tandem Repeat Database (TRDB). Using the
YASS method at its default setting [127], all tandem repeats were aligned to the hu-
man reference. Tandem repeats that aligned with similar bit scores to multiple loca-
tions of the human reference, which were likely to be transposons, were excluded.
For the rest of the tandem repeats, their motifs were used to select candidate reads.
We only used motifs that were longer than 101 bp so that these motifs cannot be
covered by any single sequence read and are problematic for most available assem-
blers.
42
Candidate reads are those which cover repeat regions. Given the sequence of a re-
peat motif, it was encircled by linking its head to its tail and then cut into 20-mers,
which are any substring of length 20 bp. All sequence reads were examined and
those containing any 20-mer of the motif were selected. Reasons for setting 20-mer
as the threshold rather than longer ones were firstly to retain sequence reads that con-
tain genuine variations as, with higher threshold, these reads risk to be omitted. Sec-
ondly, this also helps retain reads that only contain one or two erroneous bases.
Thirdly, some sequence reads only cover a small number of bases at either end of a
repeat region but they are also valid candidates conveying valuable information. Fi-
nally, repeat motifs were obtained from assembled genomes and their qualities can
vary. Setting a lower, 20-mer as a threshold is expected to allow candidate reads to
be selected even when the motif sequence has errors.
3.2.5 Contig generation method
After the selection of candidate reads, TRA assembles these reads back into com-
plete tandem repeats. For any set of selected candidate reads, these were firstly ar-
ranged in descending order following the number of bases that overlapped with the
repeat motif. From the most to the least overlapped reads, two reads are merged if
they exhibit > 20 bp overlap. This step is performed 5 times allowing from 0 to 4 in-
del in the overlap. The longest merged sequence serves as a draft contig under which
all candidate reads are simultaneously aligned. As the threshold base overlap for
alignment is set at 20 bp, candidate reads displaying < 20 bp overlap with the draft
contig are ignored. For the two types of reads, Readmerged and Readforward-gap-
Readreverse, different alignment strategies were applied. Each Readmerged is aligned un-
der a draft contig where it exhibits the most matching bases with the draft contig. For
a Readforward-gap-Readreverse, Readforwad and Readreverse are separately aligned under
the draft contig following the same strategy as for Readmerged. After alignment is
completed, all locations on the contig that are aligned with candidate reads are exam-
ined and, in three types of conditions, the candidates are omitted. First, the coverage
by Readforward and Readreverse spans more than 230 bp. Second, the aligning location of
a Readreverse is in front of the aligning location of Readforward. Third, only one read in a
43
pair can align to the draft contig. This can be only Readforward aligns to the head of the
draft contig or only Readreverse aligns to the tail of the draft contig.
After aligning candidate reads under the draft contig, a new contig was reached by
voting each of its bases from aligned reads. This new contig will then serve as the
draft contig for aligning all candidate reads. With the updated contig, candidate reads
that were not qualified in previous alignment, are re-examined based on voting activ-
ity. This dynamic updating of contig stops when base information ceases to change
in the voting result.
3.3 Result
3.3.1 Contig generation
An experimentally generated data set of human chromosome 14, as previously
used in evaluating the performance of assemblers [54], was used to evaluate the per-
formance of this new TRA method. Briefly, paired-end reads of 101-bp in length
with average 155-bp insert size were used in the assembly. Firstly, the two reads
from each paired-end reads are merged if their overlap exhibits less than 20% mis-
match (See methods) and, as a result, 90.4% of all reads were merged. The rest read
pairs lacked overlap due mainly to the error-prone 3’ sequences associated with the
Illumina platform and were left unprocessed. The 169 tandem repeats indentified on
chromosome 14 (> 101 bp in length) were used to select reads for constructing con-
tigs. This led to 169 contigs among which 167 contigs were > 200 bp with an N50
size of 562 bp (Table 3.1). N50 was calculated by ordering all contigs from the
longest to the shortest, then adding the length from the longest until reaching 50% of
total length of all contigs. The last added length is the size of N50. N90 is defined
similarly.
Table 3.1: Assembly of tandem repeats in human Chromosome 14.
Contigs To-tal
Contigs > 200 bp
N50 (bp)
N90 (bp)
Total length (bp)
Coverage
169 167 562 378 92399 bp 99.5%
44
3.3.2 Correctness of assembly
To determine whether the assembled contigs recovered the original tandem repeti-
tive regions, each contig was aligned to its corresponding original region, “Flankleft-
tandem repeat-Flankright”. The flank was 5 bp in length at each end along with tan-
dem repeats downloaded from TRDB [41]. A contig is considered to have recovered
its corresponding tandem repeat if it covers the “Flankleft-tandem repeat-Flankright”.
Among the 169 tandem repeat regions used to assemble the contigs, 141 were recov-
ered after aligning with the contigs. This recovering step did not ensure that these
recovered regions were correctly assembled. The accuracy of each base in these con-
tigs was validated by aligning these contigs to the human reference assembly
(GRCh38) and mismatch rate for each contig was generated. As allelic variations ex-
ist in the genome, some mismatches do not necessarily represent errors. Examining
the SNP database (dbSNP) in NCBI identified 1925546 active SNPs on the
107043718-bp human chromosome 14 (~1.8%). Therefore, a contig is considered
correctly assembled if it contains < 2% mismatches from the reference assembly. The
contigs are otherwise considered erroneous.
Table 3.2: Summary of contigs assembled by different assemblers for tandem repeats.
Assembler Correct Num
Error Num Total
Weak Num
Bad Num
Missing Num
TRA 123 46 20 24 2 Velvet 37 132 56 74 2 SOAPdenovo 38 131 35 94 2 SGA 28 141 55 84 2 ABySS2 64 105 25 78 2 Allpaths-LG 10 159 17 140 2 Bambus2 58 101 22 87 2 CABOG 58 101 15 94 2 MSR-CA 58 101 19 90 2
In Table 3.2, we summarized the contigs assembled by different assemblers for
tandem repeats. The threshold of match rate is set as 98%. Correct contig has >98%
45
match with reference. Erroneous contig consists of 3 categories, Weak contig, Bad
contig and Missing contig. Weak contigs aligned 98% against reference after trim-
ming either head or tail. Bad contigs have mismatch rate >2%. Missing contigs are
those shorter than 200 bp in length.
By these criteria, 123 of the 169 TRA-assembled contigs (72.3%) were correct
(Table 3.2).Twenty of the rest 46 contigs reached more than 98% match with the ref-
erence after end-trimming (weak contig) and the rest 24 contigs still exhibited less
than 98% match with reference sequence (bad contig). Two contigs were less than
200 bp and were considered failed assembly (missing contig). After excluding the
bad and missing contigs, this TRA method covered 84.3% of all tandem repeats on
chromosome 14 with a correct N50 of 512 bp (Table 3.3).
In Table 3.3, we summarized correct assemblies of assemblers for tandem repeats.
Total length of all tandem repeats is 92841 bp. For TRA, Only correct contigs and
trimmed weak contigs were counted. Correct N50 was calculated by ordering these
contigs from longest to shortest, then adding the length from the longest until reach-
ing 50% of 92841 bp. Correct N40 and correct N30 are defined similarly. For availa-
ble assembles, we obtained substring of their scaffolds, which aligned to reference of
corresponding tandem repeats. Still, only correct contigs and trimmed weak contigs
were used. “-” means failing at this assessment. It caused by the low coverage of
contigs correctly assembled for tandem repeats.
Table 3.3: Summary of correct assembly of assemblers for tandem repeats.
Assembler Corr. N50 (bp) Corr. N40 (bp) Corr. N30 (bp) Coverage
TRA 512 570 622 84.3% ABySS2 - 306 491 45.3% Allpaths-LG - - - 9.5% Bambus2 - 148 446 40.7% CABOG - - 426 37.5% MSR-CA - 64 487 40.3% SGA - - - 28.9% SOAPdenovo - - 57 30% Velvet - - 228 35.5%
46
To compare the performance of TRA with other assemblers in the analysis of tan-
dem repeats, the above specified tandem repeat regions were extracted from the
GRCh38 reference sequences and compared with scaffolds generated using the state-
of-the-art assemblers, including Velvet [74], SOAPdenovo [77], SGA [86], ABySS2
[75], Allpaths-LG [73], Bambus2 [128], CABOG [129] and MSR-CA[129]. In these
studies, scaffolds were assembled for whole chromosome 14 as detailed by Salzberg
et al. [54]. In this study, substrings in these scaffolds, that aligned to the correspond-
ing tandem repeats in the reference, were extracted. Examining the assembly accura-
cy of the above methods in these tandem repeat regions, revealed the best accurate
rate being 36.1% derived from the ABySS2 method with 64 contigs being correctly
assembled (Table 3.2).
In this new TRA tool, when an assembled contig display > 98% match with the
reference sequence, it is defined as a correct assembly taking into consideration of
SNP rate on chromosome 14. We then asked whether more contigs could be quali-
fied as correct contigs when the stringency is reduced by decreasing the threshold
match rates. The percentages of correct contigs were then estimated at a range of
match rates from 80% to 100% and the results are summarised in Figure 3.3. Match
rate refers to the percentage of matched bases in contig, comparing to reference.
When match rate threshold is 100%, it allows no mismatch in assembled contig.
Since a real dataset was assembled in our experiments, so it is unreasonable to accept
no variation. While if too many mismatch allowed, then the “correct contig” could be
false positive. When counting correct contigs, threshold of match rate is set as 98%,
to minimize the number of false negative and false positive. This is based on current
percentage of SNP in chromosome 14 (~1.8%). Though, TRA outperforms other as-
semblers whatever the match threshold is.
The method that was most sensitive to this change of match rate threshold was
SOAPdenovo which exhibited improvement in the percentage of correct contigs
from 23% to 28%. However, improvements in other methods were more trivial. TRA
remained the one single outstanding performer in this specific function. This again
shows the limitations of these prevailing assemblers in the assembly of tandem re-
peats from short sequence reads with the best performance being < 40% correct con-
47
tig assembly. TRA focuses on tandem repeat assembly and its accurate rate persisted
above >70% at high match rate thresholds.
Figure 3.3: Correctness comparison of contigs assembled by different methods for
tandem repeats.
3.3.3 Contig distribution and SNP verification
Mapping all correct contigs to reference human chromosome 14 sequence re-
vealed that the TRA-assembled tandem repeats aligned with regions throughout the
entire chromosome, except the chromosome head region (Table 3.4). The lack of
alignment with the head region was due to the lack of annotation in this 16-Mb end
region, accounting for 14.9% of the chromosome. However, the TRA-assembled
tandem repeats are densely mapped to the tail region of chromosome 14. While the
relationship between the TRA-assembled tandem repeats and the unannotated chro-
mosome head region cannot be determined without the reference sequence, it stresses
the importance in accurate assembly of repetitive sequences for constructing contin-
48
uous chromosome assemblies. The power of TRA in assembling contigs from tan-
dem repeats makes it a valuable tools to incorporate in the state-of-the-art assemblers
for scaffold construction.
Figure 3.4: SNP verification and potential new SNP loci.
In Table 3.4, the blue sequence is human reference while the black sequence is a
contig assembled by TRA. All red mismatches are verified SNPs in database of SNP
(dbSNP). Two green mismatch locations are potential new SNP locis. Alignment was
conducted by using YASS [127].
49
Figure 3.5: The distribution of correct contigs.
By comparing TRA-assembled correct contigs and the corresponding reference
sequence, a significant number of mismatches were recorded with positional infor-
mation for further analysis. As all bases in the correct TRA-assembled contigs were
supported by multiple aligning reads (see methods), these mismatches can potentially
represent genuine single nucleotide polymorphism (SNP). When the total of 261
50
mismatched locations were vetted with reference to the NCBI dbSNP, 80 locations
were found to be verified SNPs. This cannot be concluded for the rest 181 mismatch-
es. An example of SNPs in a TRA-assembled tandem repeat is shown in Table 3.5.
All correct contigs assembled by TRA are mapping to their locations at reference of
Chromosome 14. The left arc represents whole chromosome 14. While labels repre-
sent all correct contigs, forming the right arc. The no linking region at bottom corre-
sponds to 16 Mb unknown bases at the beginning of the chromosome. This graph
was drew by visualising software, Circoletto [133]. Based on the alignment, this tan-
dem repeat spans nucleotides 18967401 to 18967978 on chromosome 14. It contains
7 mismatches and 5 of these mismatches are verified SNPs in the dbSNP. The other
two mismatches could be potential SNPs identified using TRA. We also noted that
this region was not covered by assemblies derived from other assemblers.
3.3.4 Discussion
Tandem repeats are frequently found in gene promoters and coding regions and
polymorphism in these repeat regions can affect the structure, expression and func-
tions of the gene products and are associated with various diseases. The polymor-
phism of these repeats has been explored as powerful biomarkers for monogenic and
polygenic diseases but the extent of applications is hindered by the unique challenges
in accurately identify these repeats. The swift generation of massive sequence data
has revolutionized de novo and comparative genomic studies but the sequence reads
are short which make reconstruction of repetitive regions from these short reads a
daunting challenging. In this study, we have developed a new sequence assembly
tool specifically to circumvent these challenges.
TRA exploits the paired-end nature of NGS sequence reads and validated tandem
repeat motifs in reference sequences to assemble repeat-containing reads. The TRA
method was evaluated using the benchmark dataset of GAGE [54], through assem-
bling tandem repetitive regions on human chromosome 14. With this dataset, the
TRA method achieved a recovery of 84.3% of tandem repeat regions covered by cor-
rect contigs with N50 of 512 bp. This is significantly higher than the coverage
reached by currently available assembly methods among which the best coverage,
51
which was achieved with ABySS2, was 45.3%. A significant new feature that we
introduced to TRA was the pre-selection of sequence reads related to specific ge-
nomic regions to assemble these regions (See methods). Therefore, in assembling
each contig, the identities of the paired-end reads included were defined. This ena-
bles evaluation how reads quality affects contig accuracy. In this experiment, paired-
end reads of 101 bp were used with average 155-bp insert size. All read pairs are ex-
pected to exhibit overlap between the forward and reverse reads. In reality, only 90.4%
of the sequence pairs could be merged based on threshold overlap and those which
were not merged exhibited errors in 3’-end sequences.
In theory, increased inclusion of TRA-unmerged reads in any specific tandem re-
peat region is expected to reduce accuracy in the output contigs. Examining all the
contigs assembled with TRA for the percentage of merged reads (good reads) incor-
porated in output contigs, revealed that merged reads accounted for < 60% of all
reads in the majority of bad, weak and missing contigs (Table 3.6). In Table 3.6,
good reads rate refers to the percentage of merged reads in total reads used to assem-
ble each contig. Contig accuracy refers to the percentage of contig matching to refer-
ence. All contigs assembled by TRA are plotted in graph. The blue points are correct
contigs, >98% matching with reference; the light blue points are weak contigs, >98%
matching with reference after trimming; the red points are bad contigs, <98% match-
ing with reference; and pink points are missing contigs, <200 bp in length. All cor-
rect contigs are located at top of the figure. And most of them are in the region where
good reads takes more than 60%.
For the two tandem repetitive regions that TRA failed to assemble contigs, i.e.
missing contigs, only 10% of all covering reads were merged reads. This means that
a major parameter that affects the correctness of TRA-assembled contigs is the avail-
ability of good quality sequence reads that cover the contig regions. On the other
hand, TRA is robust in processing erroneous reads and is able to assembly quality
contigs from sequencing data that allows the generation of > 60% merged reads in all
covering reads for the contig regions. Improvement in sequencing technology would
be expected to raise the efficiency of all assembler tools including TRA.
52
Figure 3.6: Influence of reads quality on assembly.
Contigs assembly with TRA is based on alignment support from paired-end reads,
i.e. every base assembled in a contig represents the majority votes from several reads.
After aligning TRA-assembled contigs to the human reference, the position of each
mismatched base was recorded and vetted against valid SNPs in the dbSNP database.
This revealed that 30.7% of these mismatched bases were verified SNPs. As there is
no evidence that the rest of the mismatch bases are erroneous, the likelihood that the-
se represent SNPs cannot be excluded. In genetic terms, tandem repeat assembly is
particularly prone to losing valuable information of variations. TRA works at the
level of comparative assembly and information on repeat motifs in a reference se-
quence are essential parameters. Repeat motifs for various genomes can be extracted
from repeat databases, such as TRDB and RepBase [45].
Repeat motif also refers to sequence pattern or sequence consensus in these repeat
database. These motifs are identified based on draft genomes or Sanger’s long se-
quencing reads. For example, tandem repeats are identified by Tandem Repeats
53
Finder on assembled genome [130]. Similarly, RECON identifies repeat motifs from
a reference genome [131]. RepeatScout uses Sanger’s long sequence reads to obtain
repeat consensus [48]. Question remains how tandem repeats may be assembled de
novo from short NGS reads generated from new species genomes. In this regard, de
novo creation of repeat motif has been recently proposed using NGS data [132].
When repeat motif creation from NGS reads becomes possible, this TRA platform
can be applied in the discovery tandem repeats from new species without relying as-
sembled genomes.
3.4 Summary
This chapter proposed an innovative method of assembling tandem repeats. We
used repeat motifs from tandem repeat database to select all candidate reads. We then
assemble these reads back into complete tandem repeat. The contig generation is a
dynamic process which updates until a static result is generated.
We compared our TRA method with the-state-of-art methods. The comparison
showed that our TRA method outperformed other methods on a benchmark dataset
of Human chromosome 14.
54
Chapter 4
4 Innovative assembly method of DNA
transposon
4.1 Introduction
Repeat sequences are major components of Eukaryotic genomes. Some repetitive
elements are able to move within the genome by biological mechanism. We call the-
se repetitive elements as transposable elements. They make up a large fraction of
mammalian genome. For example, transposable elements make up 45% of human
genome, and up to 80% of maize. Normally, transposable elements are categorized in
two classes (Figure 4.1). Class I is referred as DNA transposons, they move from one
genomic location to another by a cut-and-paste mechanism. Class II elements are re-
ferred as retrotransposons, as they function via reverse transcription of an RNA in-
termediate. They move from one genomic location to another by a copy-and-paste
mechanism. In this chapter, we propose assembly method on DNA transposons and
in chapter 5, we will present assembly method on retrotransposons.
In existing assembly methods, de Bruijn graph based is most widely used. Velvet,
SOAPdenovo and ABySS2 are typical representatives of de Bruijn graph based as-
55
sembly methods. However, the process of generation de Bruijn graph needs to break
down all NGS reads into k-mers. Thus, the structural features of DNA transposons
are totally ignored. However we think structural feature is unique to DNA transposon,
therefore it cannot be abandoned during assembly process. For a DNA transposon, it
starts with a direct repeat from 5’ to 3’ and followed by an inverted terminal repeat,
while it ends up with an inverted terminal repeat followed by a direct repeat. This
feature can help assembler to find a pair of head part and tail part of the DNA trans-
poson. What we require is a method to fill up the gap between head part and tail part.
Transposase
ITR ITRDR DR
DNA transposons (Class 1)
TSD TSD
Retrotransposons (Class 2)LTR
LTR LTRGag Prt Pol Env
TSD TSD5' UTR
Retrotransposons (Class 2)Non-LTR
ORF 1 ORF 2 3' UTR
Poly-A
Figure 4.1: Types of transposable elements and their structures.
The primary contribution of this work is summarized as follows:
We proposed a dot matrix based DNA transposon feature finding algo-
rithm. This algorithm helps us to select out the pairs of seeds which are
head part and tail of DNA transposons. Furthermore, the use of structural
information provides a new direction in sequence assembly.
56
We proposed a read quality based matching algorithm. This method ex-
tends the head seed and fill up the gap between head seed and tail seed.
Furthermore, it decreases the bad effect of poor quality read to assembly
result.
We conducted a number of experiments to evaluate our DNA transposon
assembly (DTA) method. The results show that our method is superior to
the state-of-the-art methods for assembling DNA transposons.
The rest of this chapter is organized as follows. First, we give definitions to DNA
transposon from the perspective of structural feature, second we propose two meth-
ods to assemble DNA transposon, i.e. DNA transposon feature finding algorithm and
extension algorithm. Third, we evaluate our method by comparing to the-state-of-art
method using benchmark dataset. Finally, we conclude this chapter.
4.2 Method
In this section, we present our innovative algorithm for DNA transposon assembly.
This algorithm will utilize the structure feature of DNA transposon to select out the
pairs of seeds from NGS reads. Each pair of seeds contains a head part and tail part
of a DNA transposon. Then assembly process will combine a sequence quality based
method to gradually extend head seed to meet tail seed. Since this assembly process
is based on two methods, one is the method of seed selection, the other is the method
of quality based extension, we firstly present the method of seed selection, and then
quality based method of seed extension. Finally, we combine these two methods to
form our DNA transposon assembly algorithm.
4.2.1 Overview and basic terms
We formally define basic terms and overview the algorithm.
Definition 4.1 ( ). A short direct repeat sequence is a sequence of
nucleotides from 5’ to 3’, where nucleotide . If and are same,
then the length of is equal to and is same as , where
is the th nucleotide of . DNA transposon has two same , one is located at the
5’ of the sequence, and the other is located at the 3’ of the sequence.
57
Definition 4.2 ( ). An inverted terminal repeat is a
pair of sequences of nucleotides, where nucleotide . If two sequences
and are inverted terminal repeat, then the length of is equal to , and is
same as , where , and is the length of . For an , we denote
as the part located at 5’ of the sequence, and as the other part located at 3’
of the sequence. DNA transposon has a pair of and .
Definition 4.3 ( ). A sequence of nucleotides has a
DNA transposon feature if it has two parts where the part at 5’ has the structure of
and the other part at 3’ has the structure of , where is the op-
erator if , , then , and
.
DNA transposon assembly takes two phases (figure 4.2). Phase I will use NGS
data to select out pairs of seeds using feature finding algorithm, while phase II will
use seed cluster to assemble DNA transposon using extension algorithm. Specifically,
the model will take 4 steps to finish assembly.
1. Feature finding algorithm will scan NGS reads to select out pairs of seeds.
If one read has the feature of and another read has the feature
of , we select these two reads as a pair of seed, where the read
has will be the head seed and the read has will be the
tail seed.
2. For each head seed of a pair of seed, we will get a cluster of NGS reads
which are overlapped no less than 30 base pairs with head seed.
3. A quality based method will choose the most matched read from the clus-
ter to extend the head seed.
4. If the extended sequence is overlapped with tail seed, we will output this
sequence as a DNA transposon. Otherwise, we will get a cluster of NGS
reads which are overlapped no less than 30 base pairs with the extended
sequence. Then we go back to step 3, to use quality based extension meth-
od getting a further extended sequence.
58
It can be seen from the algorithm steps that the assembly process is relied on two
methods, i.e. feature finding method to select pair of seeds and quality based method
to extend the seed.
NGS data
Seeds
Extended
sequence
Feature
finding
algorithm
Extension
algorithm
ConditionDNA transposon Condition met
Condition is not met
Seed cluster
Phase I
Phase II
Figure 4.2: Overview of DNA transposon assembly method.
4.2.2 Seed selection
The seed selection method is utilizing the structural feature of DNA transposon,
i.e. and structure. To determine if two sequence reads have
DNA transposon feature, we will align them using a dot matrix. A dot matrix is con-
59
structed by writing two sequence reads to the top row and leftmost column of a two-
dimensional matrix. A dot is placed at any position where the nucleotide of that row
matches that column. If two same sequences are used to construct dot matrix, the
plotted dots at main diagonal will appear as a line. If two similar sequences are plot-
ted in a dot matrix, then the region that shares significant similarity will appear as a
line parallel to the main diagonal (Figure 4.3). Visually in a dot matrix, a mismatch
in a significant similar region will appear a blank spot off the line. Whereas, an indel
in a significant similar region will dislocate the line into two parallel lines. If the
amount of mismatch and indel is under a threshold which is subject to the length of
matched bases, we say two sequences are similar. How to use a dot matrix to align
two sequences and then determine if they are similar or not is presented in Table 4.1.
A T G C C G A T G A A T G T C
A • • • •
A • • • •
T • • • •
G • • • •
C • • •
C • • •
G • • • •
A • • • •
T • • • •
G • • • •
T • • • •
A • • • •
T • • • •
G • • • •
T • • • •
Figure 4.3: Dot matrix of two sequences.
60
Table 4.1: Dot matrix based alignment method.
1:
2:
3:
4:
5:
6:
7:
8:
9: {
10:
11: {
12: •
13:
14:
15: {
16: •
17:
18:
19:
20:
21:
22: }
23: }
24: }
25:
DNA feature finding algorithm is based on the dot matrix alignment. Given a
pairwise alignment plotted in dot matrix, if one sequence has the structure of
61
and the other one has the structure of , two regions that share
significant similarity will appear. Since is the direct repeat from 5’ to 3’, so one
region will appear as a line parallel to the main diagonal, whereas is the inverted
repeat, so the other region will appear as a line parallel to anti-diagonal (Figure 4.4).
Assume the plotted line parallel to the main diagonal is and the plotted line parallel
to the anti-diagonal is , then is the vertical bisector of . In all pairwise align-
ments, we will select out the pairs of sequences which have DNA transposon features.
We present the algorithm of finding DNA transposon feature in dot matrix in
Table 4.2. Direct repeat is the head and tail of a DNA transposon and it appears as a
plotted line parallel to main diagonal in dot matrix. If the mismatch and indel in di-
rect repeat is under the threshold, then we will go further to check if there is a plotted
line which is parallel to anti-diagonal. We use the row of the last nucleotide in
and the column of the first nucleotide in to locate the spot of . In the direction
of anti-diagonal, if there is a plotted line where mismatch and indel are under the
threshold, then we will select out this pair of sequences as seeds.
Table 4.2: DNA transposon feature finding method.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12: {
62
13:
14: {
15: •
16
17:
18:
19: {
20: •
21
22:
23:
24:
25:
26:
27: {
28: •
29:
30:
31:
32:
33:
34: }
35:
36:
37:
38: {
39: •
40:
41:
63
42:
43:
44:
45: }
46:
47:
48:
49:
50: }
51: }
52: }
53:
T A T G T A T G C C G A C C A
A • • • •
A • • • •
T • • • •
G • • •
C • • • •
C • • • •
G • • •
A • • • •
T • • • •
G • • •
T • • • •
A • • • •
T • • • •
C • • • •
A • • • •
64
Figure 4.4: Dot matrix of two sequences which have DNA transposon features.
4.2.3 Computing sequence matching score
Every base in a sequence read is told by the color of fluorescent dye labeled to the
sequence. During sequencing process, charge-coupled device (CCD) records a time
serial data, which are colors of each base. Sometimes fragment DNA is not forming a
perfect cluster after amplification, which causes density issue. Subsequently, it leads
the shape and location of the color to be inconsistent and ambiguous. It is the reason
why some bases are called incorrectly and the necessary of quality measurement.
Base-specific quality score was proposed to enhance the accuracy of consensus
sequences during assembly process [134]. Later on, it was quickly replaced by Phred
base calling method, which adopted by the Human Genome Project [135, 136]. Phred
base calling method was widely applied in majority of genome sequencing project,
because it generates 40%-50% less errors than other methods [137]. Now, Phred
quality score is the standard score scheme of measuring the base quality. We present
the scheme of Phred quality score in Table 4.3. The Phred quality score is defined
to be related to base-calling error probability , in equation 4.1. Table 4.3 explains
the meaning of Phred quality score and Figure 4.5 shows the encoding scheme made
by different versions of sequencing technologies.
Table 4.3: The meaning of Phred quality score
Phred quality score Probability of incorrect base call Base call accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10000 99.99%
50 1 in 100000 99.999%
65
!”#$%&’()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh
| | | | |
ASCII 33 59 64 73 104
Sanger 0………………………………………..26……31………………40
Solexa -5…….0………………..9……………………………………………………..40
Illumina
1.3+
0………………..9……………………………………………………..40
Illumina
1.5+
3..………..9……………………………………………………..40
Illumina
1.8+
0………………………………………..26……31………………40
Figure 4.5: Encoding scheme of different sequencing technologies.
Suppose sequence is annotated by nucleotides {
} where
, and another sequence is annotated by
tides{
}, we denote the matching score between two nucleotides
and
as
. Then the equation of nucleotide matching score should be defined
as:
{
( ) (
)
( ( ) (
))
Where is Phred quality score of that nucleotide, is a constant equal
to the maximum value of Phred quality score, and P is a constant served as penalty
for an indel. The Phred quality score is varied in different sequencing technology.
66
Normally in sequence alignment, indel is worse than mismatching. So in experiments,
we set the value of P as -1 or less than -1, to make it a higher penalty than mismatch-
ing.
Based on the nucleotide matching score, sequence matching score will be calcu-
lated as the average matching score of all nucleotides. The equation of sequence
matching score is defined as:
∑
4.2.4 Extension algorithm
By using feature finding algorithm, we obtain pairs of seeds which are heads and
tails of DNA transposons. To connect two seeds into a contig sequence, we propose
an extension algorithm. The main idea is to extend the head seed by using the most
matched sequence, and it stops until the extension part can be aligned by the tail seed.
The extension algorithm starts with finding all related sequence reads which have
at least 30bp overlap to the head seed. Then the matching score will be calculated
for every related sequence read. The sequence read with the highest matching score
will be selected to extend the head seed. We will cut the selected read at the position
which is the last overlapped base pair. Then we connect the tail to the head seed.
Now we have a longer sequence to consider if we need further extension or output as
a result. If the tail seed is not overlapped with this new sequence, we will continue
extension process. Otherwise, we merge the tail seed to the end and output it as the
final result.
We present the extension process in Figure 4.6. For demonstration purpose, we
align all related sequence reads which have at least 3bp overlap to the head seed. Af-
ter calculation of sequence matching score, we have AGAAGTTTCTGA-
GAATGCTTCT as the most matched sequence with score of 0.946. Then we cut off
the overlapping part, and only attach the tail, CTTCT, to the head seed.
67
Figure 4.6: Sequence extension demonstration.
Head Seed
The most matched sequence read with score of 0.946
Cutting position
68
4.3 Results
We evaluate our DNA Transposon assembly method on benchmark dataset, Hu-
man chromosome 14. Since DNA transposon assembly method is focusing on DNA
transposon only, we constructed 8 datasets with different sizes from the original
benchmark dataset. These 8 different datasets contain the NGS reads which are over-
lapped with DNA transposons no less than 20 base pairs. These reads contain the in-
formation of DNA transposon and are actually useful for assembly. On different da-
taset scale, these datasets contain related NGS reads to, 10, 30, 60, 100, 150, 200,
300, and 350 DNA transposons respectively.
To further validate the performance on different species, we also evaluate DNA
transposon assembly method on a real dataset of Maize, downloaded from NCBI. We
also downloaded reference sequence, B73 RefGen_v4 chromosome 8, from NCBI.
This latest reference contains 181.12 Mb sequence in total with 47% GC content. We
downloaded repetitive sequence from Maize Repeat Database which contains DNA
transposons. We constructed 8 datasets with different sizes from the download da-
taset. These 8 datasets contain related NGS reads to, 10, 20, 50, 75, 100, 150, 200,
and 300 DNA transposon respectively.
The performance of an assembly method was measured in terms of the ,
and which are defined as follows. Suppose the sequence of DNA
transposon to assemble is . Let be the number of nucleotides originally on
and be the number of assembled nucleotides for . Let be the number of over-
lapped nucleotides between original sequence and assembled sequence. Then preci-
sion and recall can be defined as:
And based on the precision and recall, we can define f-value as:
For n sequences to be assembled, then precision, recall and f-value will be arithmetic
mean.
69
∑
∑
∑
4.3.1 Parameter setting
The DNA transposon assembly algorithm refers to six parameters: which is
threshold of matched nucleotides in , is the threshold of matched nucleotides in
, is the number of mismatch allowed in , is the number of mismatch
allowed in , is the number of indel allowed in , and is the number of in-
del allowed in . Which value should be set for the parameter and generally
depends on the assembly expectation. For example, if the assembly is expected to
have more candidates as result, and are set to small value accordingly. Which
value should be set for the parameter and generally depends on the mutation
rate. For example, if the assembly is expected to have 1 variation in the , is
set to 1 accordingly.
In our experiments, we tested , , , , , and using a dataset which con-
tains related NGS reads to 50 DNA transposons. We test one parameter while set the
rest as control variable (Table 4.4 - Table 4.9). We use f-value to measure the per-
formance when parameter changes. When f-value reaches peak, the value of that pa-
rameter will be kept (Figure 4.7 - 4.12). In this case, is set to 7, is set to 5, is
set to 1, is set to1, is set to 0, and is set to 0.
70
Table 4.4: Parameter setting on .
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 4.7: Assembly performance on different .
We changed the value of from 4 to 20 when set other parameters fixed. When
equals 7, the reaches peak.
Table 4.5: Parameter setting on .
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f-va
lue
𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝐷𝑅
71
Figure 4.8: Assembly performance on different .
We changed the value of from 4 to 20 when set other parameters fixed. When
equals 5, the reaches peak.
Table 4.6: Parameter setting on .
7 7 7 7 7 7 7 7 7 7 7
5 5 5 5 5 5 5 5 5 5 5
0 1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
Figure 4.9: Assembly performance on different .
0.3
0.4
0.5
0.6
0.7
0.8
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f-va
lue
𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝐼𝑇𝑅
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐷𝑅
72
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 1, the reaches peak.
Table 4.7: Parameter setting on .
7 7 7 7 7 7 7 7 7 7 7
5 5 5 5 5 5 5 5 5 5 5
1 1 1 1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
Figure 4.10: Assembly performance on different .
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 1, the reaches peak.
Table 4.8: Parameter setting on .
7 7 7 7 7 7 7 7 7 7 7
5 5 5 5 5 5 5 5 5 5 5
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐼𝑇𝑅
73
Figure 4.11: Assembly performance on different .
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 0, the reaches peak.
Table 4.9: Parameter setting on .
7 7 7 7 7 7 7 7 7 7 7
5 5 5 5 5 5 5 5 5 5 5
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10
Figure 4.12: Assembly performance on different .
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐷𝑅
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐼𝑇𝑅
74
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 0, the reaches peak.
4.3.2 Performance comparison between ABySS2, Velvet, and
SOAPdenovo
We compared the assembly performance of our DTA method with three existing
assembly methods that also use NGS reads, but do not target on DNA transposon as-
sembly. The first existing method for comparison is SOAPdenovo method, the se-
cond one is Velvet method, and the last one is ABySS2 method. SOAPdenovo, Vel-
vet and ABySS2 are using de Bruijn graph to assemble DNA sequences. Different
strategies are applied when these three methods generate de Bruijn graph.
None of them utilize structural feature of DNA transposon in their methods.
SOAPdenovo removes erroneous connections on the graph, which including the
steps of clipping tips, removing low-coverage links, resolving tiny repeats and merg-
ing bubbles. Velvet adopts tour bus method to correct errors, including the step of
removing tips, removing bubbles, and removing erroneous connections. ABySS2 us-
es parallel computing in their method, it follows similar steps which are used in de
Bruijn graph based method. For these three methods, the way of assembling DNA
transposon is same as the way of assembling other regions, i.e. they ignore the struc-
tural feature of DNA transposon. To contrast, our DTA method studies the structural
feature of DNA transposon and has an assembly method accordingly.
None of these three existing methods utilize the quality information of NGS reads
during assembly. Even though, some bad reads are removed during the process of de
Bruijn graph generation, it still keeps a lot of noise information during contig genera-
tion. In our DTA method, we use quality to weight the NGS read and select the best
one to assemble DNA transposon.
The comparison of these four methods in terms of precision, recall and f-value
over eight datasets are shown in Figure 4.13 – 4.15 respectively. We can see from
these comparison results that the method using structural feature (i.e. DTA method)
outperformed the methods that ignore structural feature (i.e. ABySS2, Velvet, and
SOAPdenovo). Furthermore, it is obviously that our DTA method outperformed oth-
75
er methods across all evaluation datasets, even at small scale dataset. This demon-
strated that poor quality read will pull low the performance of de Bruijn graph based
methods. However, if we can filter out the bad NGS read and use only the good ones,
the performance of assembly will be increased (i.e. DTA method).
Figure 4.13: Precision comparison of DTA, ABySS2, Velvet, and SOAPdenovo
method on Human chromosome 14.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8
Pre
cisi
on
Data set
DTA
ABySS2
Velvet
SOAPdenovo
76
Figure 4.14: Recall comparison of DTA, ABySS2, Velvet, and SOAPdenovo method
on Human chromosome 14.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8
Rec
all
Data set
DTA
ABySS2
Velvet
SOAPdenovo
77
Figure 4.15: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo meth-
od on Human chromosome 14.
To evaluate the overall effectiveness and performance of the four methods we
drew precision-recall curves for the four methods from their assembly results. These
curves are shown in Figure 4.15, which demonstrates that overall prediction perfor-
mance of DTA method is better than ABySS2, Velvet and SOAPdenovo.
To observe the details of performance comparison, we choose eighth dataset that
contains 350 DNA transposon. We compared DTA method with Velvet, SOAPdeno-
vo and ABySS2 to see how many DNA transposon can be correctly assembled. The
comparison result is shown in Table 4.10. Correct means >98% nucleotides are as-
sembled; weak means >98% nucleotides are assembled with extra erroneous head or
tail; bad means < 98% nucleotides are assembled; missing means failed to assemble.
Weak assembly, bad assembly and missing assembly are also called errors. We can
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8
F-va
lue
Data set
DTA
ABySS2
Velvet
SOAPdenovo
78
see from the comparison result that our DTA method made better assembly than the
other methods.
Figure 4.16: The precision-recall curves of DTA, ABySS2, Velvet, and SOAPdeno-
vo method on Maize chromosome 8.
Table 4.10: Summary of DNA transposons assembled by different assemblers
Assembler Correct Error Weak Bad Missing
DTA 162 138 85 48 5
Velvet 98 202 134 58 10
SOAPdenovo 91 209 140 59 10
ABySS2 134 166 102 56 8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Pre
cisi
on
Recall
DTA
ABySS2
79
Figure 4.17: Precision comparison of DTA, ABySS2, Velvet, and SOAPdenovo
method on Maize chromosome 8.
To validate the consistent performance on different species. We also compare our
DTA method with ABySS2, Velvet and SOAPdenovo on Maize datasets which con-
tain many DNA transposons. The comparison in terms of precision, recall, and f-
value over eight datasets are shown in Figure 4.17 – 4.19 respectively. We can see
from the comparison results that de Bruijn graph based method performed worse
when DNA transposons account for more components in a genome. It is obvious to
tell our DTA method outperformed the other three existing methods across all da-
tasets. Moreover, it shows in transposon rich region, our DTA still consistently per-
formed well. This because when DTA method assemble each DNA transposon is in-
dependent with other regions. So the assembly result will not be affected by the envi-
ronment, i.e. transposon rich region.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pre
cisi
on
Data set
DTA
ABySS2
80
Figure 4.18: Recall comparison of DTA, ABySS2, Velvet, and SOAPdenovo method
on Maize chromosome 8.
Figure 4.19: F-value comparison of DTA, ABySS2, Velvet, and SOAPdenovo meth-
od on Maize chromosome 8.
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8
Rec
all
Data set
DTA
ABySS2
Velvet
SOAPdenovo
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 2 3 4 5 6 7 8
F-va
lue
Data set
DTA
ABySS2
Velvet
SOAPdenovo
81
4.3.3 Distribution and variation
Figure 4.20: Assembly result of MamRep434 mapping to reference.
82
Figure 4.21: Assembly result of 28 DNA transposons mapping to reference.
83
Figure 4.22: Assembly result of 103 DNA transposons mapping to reference.
84
Figure 4.23: Assembly result of 163 DNA transposons mapping to reference.
85
To observe the distribution and correctness of assembled DNA transposon, we as-
sembled DNA transposon, MamRep434, using DTA and mapped assembled result to
reference genome of Human chromosome 14. Figure 4.20 demonstrates that DTA
correct assembled MamRep434 of which the assembled result can be uniqly mapped
to chromoson14. We also randomly choosed 28 DNA transposons acroos Human
chromosome 14 and mapped assemby results in Figure 4.21. For some DNA
transposons (i.e. tigger3b), the assembled results map to many locations where across
reference genome. If we investigate futher, it is shown that it has a best hit at a spot
with partially can be mapped to many other places. The blue arcs in the figure
represent partially mapping. It implies that repeat motifs are common in DNA
transposons and these repeat motif will confuse most sequence assembler and
decrease the performance. To have a better understanding, we also assembled the
DNA transposons at the beginning part of the human chromosone 14 in Figure 4.22
and Figure 4.23. One dataset contains 103 DNA transposons and the other one
contains 163 DNA transposons. It demonstrates in the figures that our DTA method
has the ability to assemble DNA transposons in a density area.
To determine the variation in DNA transposons, we aligned all our assembled re-
sults to reference genome, GRCh 38, using Blastn. The Blastn command was run
with the parameters, -F F -e 1e-10 –E -1 –V 500 –b 500. By doing this, we have as-
sembled results very likely to have at least one hit with reference, even it is a bad as-
sembly. Then we are programmatically to filter out the alignments which are obvi-
ously beyond the scope. After filtering, for all valid alignments, we annotated inser-
tions and deletions. In our analysis, we define a deletion if more than 10 bp of the
reference genome are not matched by assembled sequences while the upstream and
downstream of the break are fully matched. We define an insertion if more than 10
bp of the assembled sequences is not matched by reference genome while the up-
stream and downstream of the break are fully matched.
In all alignments, we also found some segments of an assembled sequence are ful-
ly matched with reference gnome whereas the rest parts are not. Thus we define a
highly diverged region if more than 20 bp from both assembled sequences and refer-
ence genome are not aligned to each other (Table 4.11). In the experiments of length
distribution of insertion and deletion, we found that over 7 kb of reference sequence
86
was lost, whereas over 5 kb new sequences are in assembled sequence (Table 4.12).
Such result is expected as transposable elements which tend to have variants. Similar
findings were presented in previous experiments of assembling Arabidopsis thaliana
genomes [138].
Table 4.11: Deletion and insertion of different length in chromosome 14.
Deletion Insertion
Variant length (bp) No. Length (bp)* No. Length (bp)*
1 709 709 687 687
2 199 398 203 406
3-4 168 566 161 544
5-8 118 738 115 717
9-16 77 875 72 810
17-32 38 833 26 558
33-64 15 608 13 530
65-128 8 525 8 598
129-256 6 818 4 435
>256 6 1,482 3 454
* Cumulative length of all variants of the class in that row
Table 4.12: Highly diverged regions of different length in chromosome 14.
Highly diverged regions (HDR)
Variant Length (bp) No. Length (bp)*
20-38 4 96
39-76 9 361
77-152 17 952
153-304 18 3216
304-606 11 4562
607-1212 5 5623
>1213 1 2564
87
* Cumulative length of all variants of the class in that row
Particularly in our experiments, we investigated the frequency of indels occurring
in multiple of 3 base pairs. This is to verify if indels in multiple of 3 base pairs are
more frequently occurring than other lengths of indels [139]. We aligned all genes
sequences of chromosome 14 to our assembled results. Then, we retrieved the frac-
tion of indels in different lengths, for coding regions, intergenic regions respectively
and complete coding sequence respectively. It is obviously to find that indels in 1
base pair are most frequently occurring (Figure 4.10). Also, it is worth noting that
indels in 3 base pairs are more likely occurring than other types of indels. Thus, we
verified that indels take place preferentially in multiple of 3 base pairs.
Figure 4.24: Frequency of indel lengths and variation of coding sequence lengths in
chromosome 14.
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Frac
tio
n
Length (bp)
complete coding sequences
indels (coding regions)
indels (intergenic regions)
88
4.4 Summary
This chapter proposed an innovative method of assembling DNA transposon
(DTA). We studied the structure of DNA transposon and recognized that the struc-
tural feature can be utilized to assemble DNA transposon. Two novel algorithms are
proposed: DNA transposon feature finding algorithm and extension algorithm. DNA
transposon feature finding algorithm is based on dot matrix. We use dot matrix to
check pairwise alignment where we select out seeds have and
structure. Extension algorithm is based on computing sequence matching score
which is according to NGS read quality. We extend the head seed using the best
matched NGS read until it reaches tail seed.
We compared our DTA method with three existing methods, i.e. Velvet, ABySS2
and SOAPdenovo. The comparison showed that our DTA method outperformed oth-
er three methods not only on Human datasets but also Maize datasets. In addition, we
investigated the distribution of assembled results. Moreover, we discussed the varia-
tion in our assembled results and it showed that transposable elements are tending to
have variations.
89
Chapter 5
5 Innovative assembly method of retro
transposon
5.1 Introduction
In chapter 4, we classified transposable elements into two classes. One class is
DNA transposon, the other one is retrotransposon. In this chapter, we propose a
method on retrotransposon assembly (RA). There are two types of retrotransposon,
i.e. LTR type retrotransposon and non-LTR type retrotransposon. The structures of
two types of retrotransposon are different. For both types of retrotransposon, they
have two target site duplications at each end of the sequence. But for LTR type re-
trotransposon, there are two long terminal repeats in the sequence, while no long
terminal repeat in non-LTR type retrotransposon.
Based on the structural feature, we proposed two methods to find pairs of seeds
for each type of retrotransposon. A pair of seeds has the sequences which represent
the head part and end part of a retrotransposon. To assemble a retrotransposon, we
need a method to fill up the gap between a pair of seeds. The gap is actually a se-
90
quence that we can find in de Bruijn graph which is generated from NGS reads. Re-
cently, simple path generation was proposed to optimize de Bruijn graph. And our
method of filling gap between seeds is based on the simple paths which are generated
from de Bruijn graph.
The rest of the chapter is organized as follows. First, we give definitions to two
types of retrotransposon from the perspective of structural feature. Second, we pro-
pose two methods of finding pairs of seeds of retrotransposon and one method of fill-
ing gap between seeds. Third, we evaluate our method by comparing to the-state-of-
art method using benchmark dataset, and also we test our method by assembling nu-
cleolus-associated domains (NAD). Finally, we conclude this chapter.
5.2 Method
In this section, we present our innovative algorithm for retrotransposon assembly.
This algorithm uses NGS reads to generate simple paths which are obtained from de
Bruijn graph. Not like other de Bruijn graph based method, we also studied the struc-
tural features of retrotransposon, i.e. LTR type and non-LTR type (Figure 4.1). We
utilized the structural features of retrotransposon to select out the pairs of NGS reads
as seeds which represent both ends of the retrotransposon. A pair of seed has one
head seed which is the start of a retrotransposon, and one tail seed which is the end
of a retrotransposon .Then we map seeds on simple paths to fill up the gap between
head seed and tail seed by using a recursive method. Our retrotransposon assembly
algorithm consists of three methods. We firstly present the method of simple path
generation, secondly present the method of seed selection, and lastly present the re-
cursive method of gap filling.
5.2.1 Overview and basic terms
We formally define basic terms and overview the algorithm.
Definition 5.1 ( ). A target site duplication is a se-
quence of nucleotides from 5’ to 3’, where nucleotide . A target site du-
plication always comes with a pair of identical sequences. The length of a target site
duplication ranges from 5 base pairs to 20 base pairs.
91
Definition 5.2 ( ). A long terminal repeat is a sequence
of nucleotides from 5’ to 3’, where nucleotide . A long terminal repeat
is only shown in LTR type of retrotransposon. Comparing to TSD, the length of LTR
can be hundreds of base pairs.
Definition 5.3 ( ). A PolyA is a sequence of nucleotides from 5’ to 3’, where
nucleotide . If the length of a PolyA is 3, then the sequence of PolyA is .
Definition 5.4 ( ). A sequence of nucleotides
has a LTR type retrotransposon feature if it has two parts where the part at 5’ has the
structure of and the other part at 3’ has the structure of ,
where + is the operator that if , , then
.
Definition 5.5 ( , ). A se-
quence of nucleotides has a non-LTR type retrotransposon feature if it has two parts
where the part at 5’ has the structure of and the other part at 3’ has the structure
of , where + is the operator that if , , then
.
Retrotransposon assembly method takes two phases (Figure 5.1). Phase I will use
NGS reads to generate de Bruijn graph and also select out pairs of seeds using fea-
ture finding algorithm, while phase II will use de Bruijn graph to generate simple
paths where all pairs of seeds can find the right paths to fill up the gap. Specifically,
the model will take 3 steps to finish assembly.
1. De Bruijn graph will be generated using NGS reads. Simple paths then ob-
tained from de Bruijn graph.
2. Feature finding algorithm will scan NGS reads to select out pairs of seeds
which either have or . There are two
cases we will select two NGS reads as a pair of seed. One case is that if
one read has the structure of and the other read has the struc-
ture of . The second case is that if one read has the structure of
and the other read has the structure of .
92
3. Pairs of seeds will be applied on simple paths where a recursive method
will be used to find a path between head seed and tail seed. If a valid path
is fond then we will output the path as the assembled sequence.
It can be seen from the algorithm steps that our retrotransposon assembly method
is relied on three methods, i.e. simple path generation method, retrotransposon fea-
ture finding algorithm, and a gap filling algorithm to connect head seed to tail seed.
NGS data
Simple path
Seeds
RetrotransposonPhase I Phase II
de Bruijn
graph
generation
Feature
finding
algorithm
Gap filling
algorithm
LTR feature Non LTR feature
Figure 5.1: Overview of retrotransposon assembly method
5.2.2 De Bruijn graph generation
NGS reads will have overlap with other reads because of the deep sequencing. For
example, the tail of read A (ATGCCGGAA) is overlapped with the head of read B
(CCGGAAGAC). Sequence assemblers are utilizing such information to have short
reads merge to long sequence. However from the data perspective, same sequence in
different reads is redundant information which wastes computational memory. De
Bruijn graph is introduced to solve memory usage bound. For a de Bruijn
graph , a vertex represents a , a sequence of letters, and an edge is
created when two have overlap of letters. In DNA sequence assem-
bly, we denote set including all possible nucleotides from se-
quence reads. For a sequence , denotes its length, is the nucleotide at posi-
tion . Then we denote as a reverse-complement sequence of , and the complement
is defined as
93
So we have
For sequence and sequence , denotes their concatenation. So a de Bruijn
graph of DNA sequence is denoted as , where is an integer controlling
the length of . So, a graph contains vertices which is the set of all
possible , and edges is the set of
where is the substring of from position to .
94
Figure 5.2: De Bruijn graph with k = 24 of sequence ATCGAACGTTTTAC-
CAGCGATTCA.
Therefore, for a particular Illumina paired-end read , two sequences which
are and will be used to construct de Bruijn graph. For a simple demonstration, a
sequence read ATCGAACGTTTTACCAGCGATTCA will be generated into a de
Bruijn graph with , in Figure 5.2.
So far, we have introduced the definition of de Bruijn graph of DNA sequences. A
vertex from the graph represents a which is obtained from sequence reads.
Because of the high coverage during sequencing process, a can be con-
tained by many sequence reads. Especially for the from repetitive regions,
its frequency will be higher than the average. However, some ’s frequency
95
is lower than the average. This could be caused by the low coverage during sequenc-
ing process or erroneous base calling during sequence read process. To generate a
quality de Bruijn graph, we want to exclude of which frequency are lower
than a cutoff, and only keep those which are qualified. Given a family of reads
, we denote as the frequency of a , namely the number of
times it existing in . For a cutoff , a , is qualified to generate de
Bruijn graph when .
5.2.3 Simple path algorithm
Now, we have a quality set of ready to generate de Bruijn graph. We
need to consider how we can simplify the structure of de Bruijn graph. Specifically,
we want to decrease the amount of vertexes in graph . Given a set
of , we denote as , are compactable if , , ,
if then . The compaction process is conducted on a pair of
ble . If , then it replaces , as
where is the string concatenation operator and is the -th letter of . To
execute compaction process, we introduce the concept of minimizer which is first
introduced by [140]. The -minimizer of a is the substring of in length
of . We define to be the -minimizer of the -prefix of
and to be the -minimizer of the -suffix of . For two
and , if , then . Given a set of and
parameter , we will first obtain all -minimizers and their frequencies. Based on the
frequency, all -minimizers are sorted in ascending order. Afterwards, from the low-
est frequent -minimizer to the highest, all will be assigned to an associat-
ed -minimizer. We denote to be a container to hold its associated and
compacted sequences. The compaction process will start with the which
are associated to the lowest -minimizer. Following a greedy process,
within a will be -compacted and the results will be temporarily put in . If a
sequence in is -computable with a higher frequent -minimizer , then it will
96
be allocated to the associated container , to have a further compaction. Finally,
compaction process stops after all sequences in highest finishing compaction.
Table 5.1: Simple path generation method.
1:
2:
3:
4:
5:
6:
7: {
8:
9: {
10:
11:
12: }
13: }
14:
15:
16:
17: {
18:
19: {
20:
21:
22:
23: }
24:
25: {
97
26:
27:
28:
29:
30:
31:
32:
33:
34: }
35:
36: }
In Figure 5.3, we break down the process of simple path generation. In phase A,
we have four 3-mers, {GGA, GAG, AGT, AGG}. In phase B, we obtain all 2-
minimizers from 3-mers and order them by ascending way according to frequency. In
phase C, we start compaction process from associated to GT which is the lowest
frequent 2-minimizer. Since AGT is the only 3-mer in , it will be directly put
in and then be allocated to which holds the next higher 2-compactable 3-
mers. Following the order, 3-mers in , {GGA, AGG}, will be compacted into
AGGA and then be allocated to . In the end, all sequences in will be com-
pacted and output result in , as {AGT, CAACA}.
A:
AGTGGA
AGG
GAG
B:
GT GG GA AG
1 2 2 3
98
C:
Step 1 2 3 4
GT
GG GA AG
AGT
GGA
AGG
GAG
AGGA
AGT
AGGAG
AGT
GGA
AGG
GAG
AGGA
AGT
AGGAG
AGT
AGGA
AGGAG
AGT
AGGAG
Figure 5.3: Demonstration of simple path generation process.
5.2.4 Select LTR feature seeds
We use structural feature of LTR type retrotransposon to select LTR seeds, i.e.
structure and structure. Every two NGS reads will be plot-
ted in a dot matrix which is introduced in chapter 4. If one read has the structure of
and the other read has the structure of , then they will be se-
lected out as a pair of seed, where the read has structure will be head
seed and the read has structure will be tail seed.
For a pairwise alignment in dot matrix, if one read has the structure of
, and the other read has the structure of , two regions that share sig-
nificant similarity will appear (Figure 5.4). Visually, these two regions will appear as
two parallel lines which are parallel to main diagonal. Ideally, in the assumption of
no mismatch and no indel, the plotted two parallel liens will share one vertical bisec-
tor. Target site duplication is the head and tail of a retrotransposon, it appears as a
plotted line parallel to main diagonal in dot matrix. The longer a line is the more
matches two reads will have. A mismatch will show a blank spot off the line while an
99
indel will dislocate the line in the way of moving up one position or moving down
one position. If the plotted line is long enough to be a , and mismatch and indel
are under threshold, we will go further to check if there is another line parallel to
.
T T C G A G C A A T C C T A C
A • • • •
A • • • •
T • • • •
C • • • • •
T • • • •
C • • • • •
G • •
A • • • •
G • •
C • • • • •
A • • • •
G • •
C • • • • •
G • •
T • • • •
Figure 5.4: Dot matrix of two sequences which have LTR features.
We present the method of finding LTR feature in dot matrix in Table 5.2. In the
demonstration, we assume the read has structure is written to the left-
most column of the dot matrix while the read has structure is written to
the top row of the dot matrix. The algorithm starts scanning the dot matrix to find a
plotted line which represents . If the thresholds of match, mismatch and indel for
are satisfied, we will go further. Then we continue scanning to find another
100
plotted line from the next row where ends. If the thresholds of match, mismatch
and indel for are satisfied, we will select these two reads as a pair of seed.
Table 5.2: Long terminal repeat feature finding method
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12: {
13:
14: {
15: •
16
17:
18:
19: {
20: •
21
22:
23:
24:
101
25:
26: {
27: •
28:
29:
30:
31:
32:
33: }
34:
35:
36: }
37: }
38: }
39:
5.2.5 Select Non-LTR feature seeds
There is no long terminal repeat in non-LTR type retrotransposon. So this type of
retrotransposon is referred as Non-LTR. For Non-LTR at 3’ end, there is a
track in front of . For a pairwise alignment in dot matrix, if one read has the
structure of and the other read has the structure of , one region
that shares significant similarity will appear. Since TSD is from 5’ to 3’, so this re-
gion will appear as a line parallel to main diagonal. Visually, from the perspective of
starting location, this plotted line will start nearby a track (Figure 5.5).
To find a non-LTR feature in dot matrix, we need a method to find track
first. We present a simple repeat finding method in Table 5.3. Simple repeat finding
is a method which takes a DNA sequence and outputs the length and start location of
each track. This method implements a greedy algorithm. Specifically, given a
DNA sequence, for example, Chromosome 1 of Human Genome, simple repeat find-
ing method will traverse from the first nucleotide to the last nucleotide one by one.
102
Whenever there is a track of which length is longer than a threshold, it will
output the start location and length.
C G T C A A A A T G C G C A C
G • • •
T • •
G • • •
C • • • • •
G • • •
C • • • • •
A • • • • •
C • • • • •
T • •
A • • • • •
T • •
G • • •
C • • • • •
G • • •
T • •
Figure 5.5: Dot matrix of two sequences which have Non-LTR features.
For non-LTR type retrotransposon, we are only interested in the pairs of seeds
which contain track. So for a pairwise alignment in dot matrix, we will use
simple repeat finding method to locate the position of track. If a PolyA track
is found, then we will scan nearby region to find a plotted line which is parallel to
main diagonal. We present the non-LTR feature finding method in Table 5.4. In the
demonstration, we assume the read has structure is written to the top
row and the read has structure is written to the leftmost column. We use last
column of the track to locate the beginning of . If a long plotted line is
103
fond and the thresholds of match, mismatch and indel for are satisfied, we will
select these two reads as a pair of seed.
Table 5.3: Simple repeat finding method
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
Table 5.4: Non – LTR feature finding method
1:
2:
104
3:
4:
5:
6:
7:
8:
9:
10:
11: {
12:
13: {
14: •
15:
16:
17: {
18: •
19:
20:
21:
22:
23:
24: }
25: }
26: }
27:
5.2.6 Gap filling method
By using LTR feature finding method and non-LTR feature finding method, we
obtain pairs of seeds which are heads and tails of retrotransposon. To fill up the gap
105
between head seed and tail seed, we apply a recursive algorithm to find a valid path
on simple paths which are generated from de Bruijn graph. The idea of assembling
retrotransposon is presented in Figure 5.6. The head part and tail part of retrotrans-
poson are identified. The middle regions of retrotransposon are represented in simple
paths. For example, for non-LTR type retrotransposon, we have a pair of seeds and
identified from feature finding method where is the head seed and is the tail
seed. The middle regions of the non-LTR type retrotransposon are 5’ UTR, ORF1,
ORF2 and 3’UTR which are simple paths generated from de Bruijn graph. At same
time we traverse the first level of right nodes of and the first level of left nodes of
c’, if two nodes founded are cascade, the method will stop traversing and merge the
path as a retrotransposon, otherwise the method will continue on the second level of
nodes until a valid path found. The method is presented in Table 5.5.
TSD TSD
RetrotransposonsLTR
LTR LTRGag Prt Pol Env
TSD TSD5' UTR ORF 1 ORF 2 3' UTR
Poly-A
A
BC
Simple Path sequences
B'
C'
A'
RetrotransposonsNon-LTR
Figure 5.6: Demonstration of contig generation of retrotransposon
106
Table 5.5: Contig generation method
1:
2:
3:
4:
5: {
6: ( )
7:
8:
9:
10: {
11:
12: {
13:
14: {
15:
16: }
17: }
18: }
5.3 Result
We evaluate our retrotransposon assembly method on benchmark dataset, Human
chromosome 14.We constructed 8 datasets with different sizes from the original
benchmark dataset. These 8 different datasets contain the NGS reads which are over-
lapped with retrotransposons no less than 20 base pairs. These reads contain the in-
107
formation of retrotransposons and are actually useful for assembly. On different da-
taset scale, these datasets contain related NGS reads to, 10, 20, 30, 40, 50, 60, 70,
and 100 retrotransposons.
Suppose the retro-transposon pretended as unknown is . Let be the number of
retro-transposon originally on reference genome and be the number of assembled
sequences for . Let be the overlap between original reference set and assembled
set. Then precision and recall can be defined as:
And based on the precision and recall, we can define f-value as:
For n sequences to be assembled, then precision, recall and f-value will be arith-
metic mean.
∑
∑
∑
5.3.1 Parameter setting
For LTR type retrotransposon, the assembly method refers to six parameters:
which is the threshold of matched nucleotides in , which is the threshold of
matched nucleotides in , which is the number of mismatch allowed in ,
which is the number of mismatch allowed in , which is the number of indel
allowed in , and which is the number of indel allowed in . In our experi-
108
ments, we tested , , , , , and using a dataset which contains related NGS
reads to 50 LTR type retrotransposons. We test one parameter while set the rest as
control variable (Table 5.6 - Table 5.11). We use f-value to measure the performance
when parameter changes. When f-value reaches peak, the value of that parameter
will be kept (Figure 5.7 - 5.12). In this case, is set to 8, is set to 24, is set to 0,
is set to2, is set to 0, and is set to 1.
Table 5.6: Parameter setting on .
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Figure 5.7: Performance of retrotransposon assembly method on different .
We changed the value of from 4 to 20 when set other parameters fixed. When
equals 8, the reaches peak.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f-va
lue
𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝑇𝑆𝐷
109
Table 5.7: Parameter setting on .
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Figure 5.8: Performance of retrotransposon assembly method on different .
We changed the value of from 15 to 30 when set other parameters fixed. When
equals 24, the reaches peak.
0.575
0.58
0.585
0.59
0.595
0.6
0.605
0.61
0.615
0.62
0.625
0.63
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
f-va
lue
𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠 𝑖𝑛 𝐿𝑇𝑅
110
Table 5.8: Parameter setting on .
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Figure 5.9: Performance of retrotransposon assembly method on different .
We changed the value of from 0 to 15 when set other parameters fixed. When
equals 0, the reaches peak.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝑇𝑆𝐷
111
Table 5.9: Parameter setting on .
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
24 24 24 24 24 24 24 24 24 24 24 24 24 24 24 24
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Figure 5.10: Performance of retrotransposon assembly method on different .
We changed the value of from 0 to 15 when set other parameters fixed. When
equals 2, the reaches peak.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐿𝑇𝑅
112
Table 5.10: Parameter setting on .
8 8 8 8 8 8 8 8 8 8 8
24 24 24 24 24 24 24 24 24 24 24
0 0 0 0 0 0 0 0 0 0 0
2 2 2 2 2 2 2 2 2 2 2
0 1 2 3 4 5 6 7 8 9 10
1 1 1 1 1 1 1 1 1 1 1
Figure 5.11: Performance of retrotransposon assembly method on different .
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 0, the reaches peak.
Table 5.11: Parameter setting on .
8 8 8 8 8 8 8 8 8 8 8
24 24 24 24 24 24 24 24 24 24 24
0 0 0 0 0 0 0 0 0 0 0
2 2 2 2 2 2 2 2 2 2 2
0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝑇𝑆𝐷
113
Figure 5.12: Performance of retrotransposon assembly method on different .
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 0, the reaches peak.
For non-LTR type retrotransposon, the assembly method refers to four parame-
ters: which is the threshold of matched nucleotides in , which is the length
threshold of track, which is the number of mismatch allowed in , and
which is the number of indel allowed in . We tested , , and using a dataset
which contains related NGS reads to 50 non-LTR retrotransposons. We test one pa-
rameter while set the rest as control variable (Table 5.12 - Table 5.15). We use f-
value to measure the performance when parameter changes. When f-value reaches
peak, the value of that parameter will be kept (Figure 5.13 - 5.16). In this case, is
set to 9, is set to 5, is set to 0 and is set to 0.
Table 5.12: Parameter setting on .
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑 𝑖𝑛 𝐿𝑇𝑅
114
Figure 5.13: Performance of retrotransposon assembly method on different .
We changed the value of from 3 to 20 when set other parameters fixed. When
equals 9, the reaches peak.
Table 5.13: Parameter setting on .
9 9 9 9 9 9 9 9 9 9 9 9 9
2 3 4 5 6 7 8 9 10 11 12 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
Figure 5.14: Performance of retrotransposon assembly method on different .
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f-va
lue
𝑡 𝑟𝑒𝑠 𝑜𝑙𝑑 𝑜𝑓 𝑚𝑎𝑡𝑐 𝑒𝑑 𝑛𝑢𝑐𝑙𝑒𝑜𝑡𝑖𝑑𝑒𝑠, 𝑡
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2 3 4 5 6 7 8 9 10 11 12 13 14 15
f-va
lue
𝑙𝑒𝑛𝑔𝑡 𝑜𝑓 𝑝𝑜𝑙𝑦𝐴 𝑡𝑟𝑎𝑐𝑘, 𝑙
115
We changed the value of from 2 to 15 when set other parameters fixed. When
equals 5, the reaches peak.
Table 5.14: Parameter setting on .
9 9 9 9 9 9 9 9 9 9 9
5 5 5 5 5 5 5 5 5 5 5
0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0
Figure 5.15: Performance of retrotransposon assembly method on different .
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 0, the reaches peak.
Table 5.15: Parameter setting on .
6 6 6 6 6 6 6 6 6 6 6
5 5 5 5 5 5 5 5 5 5 5
0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑚𝑎𝑡𝑐 𝑎𝑙𝑙𝑜𝑤𝑒𝑑, 𝑚
116
Figure 5.16: Performance of retrotransposon assembly method on different .
We changed the value of from 0 to 10 when set other parameters fixed. When
equals 0, the reaches peak.
5.3.2 Performance comparison between ABySS2, Velvet, and
SOAPdenovo
We compared the assembly performance of our RA method with three existing as-
sembly methods that also use NGS reads, but do not target on DNA transposon as-
sembly. The first existing method for comparison is SOAPdenovo method, the se-
cond one is Velvet method, and the last one is ABySS2 method. SOAPdenovo, Vel-
vet and ABySS2 are also using de Bruijn graph to assemble DNA sequences. In
comparison, our RA method is based on simple paths which are generated from de
Bruijn graph.
None of the three existing methods utilize the structure feature of retrotransposon,
while, our RA method studied structural features of two types of retrotransposon, i.e.
LTR type retrotransposon and non-LTR type retrotransposon. SOAPdenovo and
Velvet methods use different strategies to obtain consensus contig sequence from de
Bruijn graph. ABySS2 method adopted parallel computing strategy to assemble se-
quences based on de Bruijn graph. These three existing methods are genome wide
assembler which has the same assembly strategy for any type sequence.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1 2 3 4 5 6 7 8 9 10
f-va
lue
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑑𝑒𝑙 𝑎𝑙𝑙𝑜𝑤𝑒𝑑, 𝑖
117
The comparisons of these four methods in terms of precision, recall and f-value
over eight datasets are shown in Figure 5.17 – 5.19 respectively. We can see from
these comparison results that method using structural feature (i.e. RA method) out-
performed the methods that ignore structural feature (i.e. ABySS2, Velvet, and
SOAPdenovo). We also noticed that our RA method is superior to other methods
across all datasets. The big performance gap in terms of precision, recall and f-value
indicates that the method dedicated to retrotransposon (i.e. RA method) outper-
formed the methods which treat retrotransposon as common sequence (i.e. ABySS2,
Velvet, and SOAPdenovo).
Retrotransposon follows a “cut-and-paste” mechanism, which means one re-
trotransposon will have many copies across the genome. To have a detailed look our
RA method, we assembled one LTR type retrotransposon, i.e. PABL_A, two non-
LTR type retrotransposon, i.e. AluSx and L1PB. We mapped the assembly results to
the reference sequence of Human chromosome 14, in Figure 5.20-5.22. Every arc in
the figure represents a mapping between one location at reference and one assembled
result from RA method. As we can see from the mapping in Figure 5.20-5.22, for
both LTR type retrotransposon and non-LTR type retrotransposon, our RA mehtod
has the ability to assemble different copies which are distributed across the
chromosome.
Figure 5.17: Precision comparison of RA, ABySS2, Velvet, and SOAPdenovo meth-
od on Human chromosome 14.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pre
cisi
on
Data set
RAABySS2VelvetSOAPdenovo
118
Figure 5.18: Recall comparison of RA, ABySS2, Velvet, and SOAPdenovo method
on Human chromosome 14.
Figure 5.19: F-value comparison of RA, ABySS2, Velvet, and SOAPdenovo method
on Human chromosome 14.
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
1 2 3 4 5 6 7 8
Rec
all
Data set
RA
ABySS2
Velvet
SOAPdenovo
0.05
0.15
0.25
0.35
0.45
0.55
0.65
0.75
0.85
0.95
1 2 3 4 5 6 7 8
F-va
lue
Data set
RAABySS2VelvetSOAPdenovo
119
Figure 5.20: Assembly result of PABL_A (LTR type retrotransposon) mapping to
reference.
120
Figure 5.21: Assembly result of AluSx (SINE, non-LTR type retrotransposon)
mapping to reference.
121
Figure 5.22: Assembly result of L1PB (LINE, non-LTR type retrotransposon)
mapping to reference.
122
5.3.3 Assemble nucleolus associated domain using RA method
To test the performance of our RA method in large dataset, i.e. genome wide as-
sembly, we experimented our method on a nucleolus-associated domains (NAD) da-
taset which contains NGS reads across Human genome. NAD was previously re-
vealed to have rich contents of repetitive sequences. The sequence of these domains
is not revealed and it is not suitable to assemble those using existing methods which
performed bad on repetitive sequences. Here, we report the sequence of NADs in
Human Genome by using our RA method. NAD domains was assembled and then
mapped against human genome. Here we report the assembled sequences and our
analysis based on assembled result.
Figure 5.23: Simple path of nucleolus-associated domains.
In our experiments of assembling NAD, we use 27mer to generate de Bruijn graph
from which we obtain simple paths. For all simple path of which the length is longer
than 101 base pairs, we map them to the reference of Human genome. With default
setting of blastn, we found that in total of 4216652 simple paths, we have 4607768
3685520
446919
51120
8906
3341
1339
674
471 272
230
182
102 63
53 41
38 16 16 16 12 11 9 7 9 2
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
75% 80% 85% 90% 95% 100%
Am
ou
nt
Match rate againt GRCh38
123
hits with reference genome, i.e. GRCh38. 4199369 simple paths have at least one hit
with GRCh38. Among 4607768 hits, 4403006 hits has more than 97% match rate
with reference genome. However there are 17283 simple paths have no hit with the
reference (Figure 5.23). In Figure 5.23, we present the amount of simple path on dif-
ferent match rate against the reference gnome. It is noticed that majority of the sim-
ple paths have very high match rate, which suggest that assembled simple paths are
in high quality.
Then we applied our retrotransposon assembly method on simple paths, we as-
sembled 203637 sequences which have more than one hit with the reference genome.
These assembled sequences have more than 97% match rate against human genome
in multiple locations. Specifically, 156986 sequences have more than 50 hits and
9216 sequences have more than 200 hits. In Table 5.16, for each chromosome of
Human genome, we list the longest NAD assembled, total length of NAD and the
percentage of NAD in the chromosome. We found that about one thirds of Human
genome are associated to NAD.
124
Table 5.16: Statistics of assembled NAD in Human genome
Chromosome Longest NAD (bp) NAD bases (bp) NAD / Whole (%)
1 3173 90588466 36.3873 2 2285 76462033 31.5706 3 1874 64788486 32.6727 4 1373 49320570 25.9289 5 4251 69184119 38.1099 6 1998 54897016 32.14 7 1641 59016522 37.0367 8 3515 54017426 37.2178 9 2782 45878913 33.1508
10 1841 50006735 37.375 11 1474 44011019 32.5799 12 2485 52561713 39.4384 13 2320 33925276 29.6642 14 6991 30119144 28.1372 15 3677 39021421 38.2596 16 2163 33018295 36.5496 17 3561 29536927 35.4766 18 1611 28772173 35.7982 19 1472 23012482 39.2586 20 1529 24784924 38.4595 21 3257 20144623 43.127 22 1627 15168924 29.8492 X 4035 52184043 33.4425 Y 4035 5514692 9.6365
We investigated what genes get involved in NAD and how these genes distributed
on 24 chromosomes. We aligned all assembled sequences to human genome. For
every hit of more than 97% match rate, we record the start location and end location.
If the region covers a gene sequence, we highlight it with a blue horizontal line in
chromosome ideogram (Figure 5.24). In Figure 5.24, we demonstrate all genes that
are related to NAD. We found that these genes are widely distributed across 23
chromosomes, but only 7 related gene found in chromosome Y. The detailed infor-
mation of how many genes in NAD are listed in Table 5.17.
To learn what functions NAD will play in Human genome, we annotated these re-
lated genes using GO, an annotation of gene function. In Table 5.18, we listed 15
most significant genes which have functions only in NAD. PARK7 is found that 34
functions are only acted in NAD. Our analysis showed that NAD genes take part in
125
specific biological processes and molecular functions (Figure 5.19). For example,
function of translation repressor activity is only related to NAD. However, some
gene functions are not performed in NAD (Figure 5.20). For example, vitamin D
metabolic process is only functioned outside of NAD.
Table 5.17: No of gene in NAD
Chromosome No. of gene in NAD Chr1 1049 Chr2 712 Chr3 615 Chr4 372 Chr5 556 Chr6 505 Chr7 531 Chr8 370 Chr9 402
Chr10 447 Chr11 589 Chr12 558 Chr13 211 Chr14 334 Chr15 398 Chr16 460 Chr17 535 Chr18 172 Chr19 694 ChrX 403 ChrY 7
126
Figure 5.24: NAD gene distribution.
127
Table 5.18: Genes that have functions only in NAD.
Gene Name No. of Function In NAD No. of Total Function PARK7 34 120
CTNNB1 16 191 GATA3 16 118 AQP1 15 92
AKR1C3 14 62 LRRK2 12 120 XCL1 12 42
ADIPOQ 11 87 SIRT1 11 143 ITGAV 11 68
SERPINA5 11 31 SOX9 11 118 PAOX 11 20 PTK2B 10 124 MSH2 10 58
Table 5.19: 10 gene functions only associated with NAD
Gene Ontology Frequency Description Category
GO:0052697 9 xenobiotic glucuronidation biological_process
GO:0030371 9 translation repressor activity molecular_function
GO:1904224 8 negative regulation of glucu-
ronosyltransferase activity
biological_process
GO:2001030 8 negative regulation of cellular
glucuronidation
biological_process
GO:0004908 7 interleukin-1 receptor activity molecular_function
GO:0060100 7 positive regulation of phagocy-
tosis, engulfment
biological_process
GO:0004703 7 G-protein coupled receptor ki-
nase activity
molecular_function
GO:0010529 7 negative regulation of transpo-
sition
biological_process
GO:1901018 6 positive regulation of potassium
ion transmembrane transporter
activity
biological_process
GO:0034351 6 negative regulation of glial cell
apoptotic process
biological_process
128
Table 5.20: 10 gene functions not associated with NAD
Gene Ontology Frequency Description Category GO:0042359 10 vitamin D metabolic process biological_process
GO:0050795 9 regulation of behavior biological_process
GO:0072321 8 chaperone-mediated protein
transport
biological_process
GO:0003073 7 regulation of systemic arterial
blood pressure
biological_process
GO:0009629 7 response to gravity biological_process
GO:0046785 7 microtubule polymerization biological_process
GO:0031987 7 locomotion involved in locomo-
tory behavior
biological_process
GO:0008064 7 regulation of actin polymerization
or depolymerization
biological_process
GO:0006561 6 proline biosynthetic process biological_process
GO:0030816 6 positive regulation of cAMP met-
abolic process
biological_process
5.4 Summary
This chapter proposed an innovative method of assembling retrotransposon (RA).
Via dot matrix, we studied structural features of both types of retrotransposon, i.e.
LTR type retrotransposon and non-LTR type retrotransposon. Three novel algo-
rithms are proposed: LTR feature finding algorithm, non-LTR feature finding algo-
rithm and recursive gap filling algorithm. We used two feature finding algorithms to
find the pairs of seeds of retrotransposon, then we apply recursive gap filling algo-
rithm to assemble the retrotransposon.
We compared our DTA method with three existing methods, i.e. Velvet, ABySS2
and SOAPdenovo. The comparison showed that our DTA method outperformed oth-
er three method on a benchmark dataset. We also use our RA method to assemble
nucleolus-associated domains which is a genome wide dataset. Downstream analysis
on assembled sequence indicates we have successfully assembled 203637 retrotrans-
posons.
129
Chapter 6
6 Conclusion and Future work
This chapter reviews the whole thesis and summarize its main contribution. Further-
more, we discuss the potential research topics for the future work.
6.1 Contributions
This thesis proposes the methods of assembling repetitive DNA sequence using
next generation sequencing reads. Theoretical and experimental results have led to
the conclusion and main contribution of this thesis. These are:
Repeats are playing crucial roles in genome evolution and they are be-
lieved to be associated with diseases. To successfully assemble these re-
gion will help geneticists conduct downstream analysis. This thesis ad-
dressed such requirement. We reviewed existing method of sequence as-
sembly in chapter 2. Previous methods focus on genome wide assembly
while our proposed methods are focusing on assembling repetitive DNA
sequences. In this thesis, we proposed three methods to assemble three dif-
ferent types of repetitive DNA sequences, i.e. tandem repetitive DNA se-
quence, DNA transposon and retrotransposon.
We proposed a novel method to assemble tandem repetitive sequence. Eu-
karyotic genomes contain high volumes of intronic and intergenic regions
130
in which repetitive sequences are abundant. These repetitive sequences
represent challenges in genomic assignment of short read sequences gen-
erated through next generation sequencing and are often excluded in anal-
ysis losing invaluable genomic information. We presented a method in
chapter 3, known as TRA, for the assembly of tandem repetitive sequences
using paired-end read. In previous methods, tandem repetitive sequences
are not correctly assembled due to unsuitable strategy. By selecting out all
related reads to a tandem repat, our proposed method can dynamically as-
semble the sequence until we get a static result.
We studied the classification of transposable elements. There are two
types of transposable elements, i.e. DNA transposon and retrotransposon.
We proposed a novel method to assemble DNA transposon in chapter 4.
We investigated the structure of DNA transposon and use structural fea-
ture to select out pairs of seeds which represent the head and tail of DNA
transposons. We also proposed a read quality based method which extends
the head seed to tail seed. Performance comparison on benchmark dataset
shows our DTA method is superior to other three existing methods on as-
sembling DNA transposon
We further proposed a novel method to assemble retrotransposon in chap-
ter 5. There are two types of retrotransposon, i.e. LTR type retrotranspos-
on and non-LTR type retrotransposon. We studied the structures of both
types of retrotransposon. For each type of retrotransposon, we have meth-
ods to select out pairs of seeds which represent head seed and tail seed of
retrotransposons. We then use a recursive algorithm to find a path between
head seed and tail seed from simple paths which are generated from de
Bruijn graph. We tested our RA method on a genome wide dataset which
contains NGS reads of nucleolus associated domains. The assembled re-
sults real that our RA method has the ability to assemble repeats rich re-
gions.
131
6.2 Future works
Although the proposed methods have some success on assembling repetitive DNA
sequences, there remains several problems that need to be addressed.
De novo assembly of tandem repeats: Our proposed method of assembling
tandem repetitive DNA sequence is based on a repeat pattern which was
previously revealed. In other words, we need a reference sequence to start
assembly process. This is suitable to the species which are previously as-
sembled and perfect for resequencing project. However, for the sequenc-
ing project on new species which have no reference, we still need a meth-
od to assemble tandem repetitive sequences.
Further investigation on non-LTR type retrotransposon: non-LTR type re-
trotransposon can be further classified into short interspersed nuclear ele-
ments (SINE) and long interspersed nuclear elements (LINE). In terms of
length, SINEs are from 50-500 base pairs long and LINEs are about 7000
base pairs long. In this thesis, we have not answered the question of what
assembly strategy should be applied to SINE and LINE. Our work at chap-
ter 5 is still at the very beginning stages.
To combine repeat assembly modules in whole genome assembly: Whole
genome assembly methods always failed at assembling repetitive DNA
sequences while our proposed methods are targeting on assembling differ-
ent types of repetitive DNA sequences. So the problem remains on how to
combine two aspects together into a model which is able to assemble
whole genome sequences with good performance on repeat regions.
Downstream analysis on assembled repetitive DNA sequences: In this the-
sis, we attempted several studies on assembled results. For example, varia-
tion validation in chapter 3 and chapter 4. However, these analyses are still
at the very beginning stages.
132
Bibliography
[1] Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-
terminating inhibitors, Proceedings of the National Academy of Sciences of the
USA, 74, 12, 5463-5467
[2] Collins, F.S., Morgan, M. and Patrinos, A. (2003) The Human Genome Project:
lessons from large-scale biology, Science, 300, 5617, 286-290
[3] Campagna, D., Romualdi, C., Vitulo, N., Favero, M.D., Lexa, M., Cannata, N.
and Valle, G. (2005) RAP: a new computer program for de novo identification
of repeated sequences in whole genomes, Bioinformatics, 21, 5, 582-588
[4] Feschotte, C., Jiang, N. and Wessler, S.R. (2002) Plant transposable elements:
where genetics meets genomics, Nature Review Genetics, 3, 329-341
[5] Treangen, T.J. and Salzberg, S.L. (2012) Repetitive DNA and next-generation
sequencing: computational challenges and solutions, Nature Review Genetics,
13, 36-46
[6] Lefebvre, A., Lecroq, T., Dauchel, H. and Alexandre, J. (2003) FORRepeats:
detects repeats on entire chromosomes and between genomes, Bioinformatics,
19, 3, 319-326
[7] Fujita, P.A., Rhead, B., Zweig, A.S., Hinrichs, A.S., Karolchik, D., Cline, M.S.,
Goldman, M., Barber, G.P., Clawson, H., Coelho, A., Hillman-Jackson, J., Hsu,
F., Kirkup, V., Kuhn, R.M., Learned, K., Li, C.H., Meyer, L.R., Pohl, A.,
Raney, B.J., Rosenbloom, K.R., Smith, K.E., Haussler, D. and Kent, W.J.
133
(2011) The UCSC Genome Browser database: update 2011, Nucleic Acids Re-
search, 39: D876-D882
[8] Alkan, C., Sajjadian, S. and Eichler, E.E. (2011) Limitation of next-generation
genome sequence assembly, Nature Methods, 8, 1, 61-65
[9] Sanger, F., Nicklen, S. and Coulson, A.R. (1977). DNA sequencing with chain-
terminating inhibitors, Proceedings of the National Academy of Sciences of the
USA, 74, 12, 5463-5467
[10] Anderson, S. (1981) Shotgun DNA sequencing using cloned DNase l-generated
fragments, Nucleic Acids Research, 9 (13): 3015-27
[11] Altshuler, D., Pollara, V., Cowles, C., Van, E.W., Baldwin, J., Linton, L. and
Lander, E. (2000), An SNP map of the human genome generated by reduced
representation shotgun sequencing, Nature, 407, 513-516
[12] Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos,
M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., Kerlavage, A.R,
McCombie, W.R. and Venter, J.C. (1991) Complementary DNA sequencing:
expressed sequence tags and human genome project. Science, 252, 1651-1656
[13] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben,
L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L.,
Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Ir-
zyk, G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.B.,
Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J.,
Lohman, K.L., Lu, H., Makhijani, V.B., McDade, K.E., Mckenna, M.P., Myers,
E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth, G.T.,
Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R., To-
masz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P.,
134
Yu, P., Begley, R.F. and Rothberg J.M. (2005) Genome sequencing in mir-
crofabricated high-density politer reactors, Nature, 437, 376-380
[14] Bentley, D.R., Balasubramanian, S., Swerdlow, H.P., Smith, G.P., Milton, J.,
Brown, C.G., Hall, K.P., Evers, D.J., Barnes, C.L., Bignell, H.R. Boutell, J.M.,
Bryant, J., Carter, R.J., Cheetham, R.K., Cox, A.J., Ellis, D.J., Flatbush, M.R.,
Gormley, N.A., Humphray, S.J., Irving, L.J., Karbelashvili, M.S., Kirk, S.M.,
Li, H., Liu, X., Maisinger, K.S., Murray, L.J., Obradovic, B., Ost, T., Parkinson,
M.L., Pratt, M.R., Rasolonjatovo, I.M.J., Reed, M.T., Rigatti, R., Rodighiero,
C., Ross, M.T., Sabot, A., Sankar, S.V., Scally, A., Schroth, G.P., Smith, M.E.,
Smith, V.P., Spiridu, A., Torrance, P.E., Tzonev, S.S., Vermaas, E.H., Walter,
K., Wu, X., Zhang, L., Alam, M.D., Anastasi, C., Aniebo, I.C. Bailey, D.M.D.,
Bancarz, I.R., Banerjee, S., Barbour, S.G., Baybayan, P.A., Benoit, V.A., Ben-
son, K.F., Bevis, C., Black, P.J., Boodhun, A., Brennan, J.S., Bridgham, J.A.,
Brown, R.C., Brown, A.A., Buermann, D.H., Bundu, A.A., Burrows, J.C.,
Carter, N.P., Castillo, N., Catenazzi, M.C.E., Chang, S., Cooley, R.N., Crake,
N.R., Dada, O.O., Diakoumakos, K.D., Dominguez-Fernandez, B., Earnshaw,
D.J., Egbujor, U.C., Elmore, D.W., Etchin, S.S., Ewan, M.R., Fedurco, M., Fra-
ser, L.J., Fajardo, K.V.F., Furey, W.S., George, D., Gietzen, K.J., Goddard,
C.P., Golda, G.S., Granieri, P.A., Green, D.E., Gustafson, D.L., Hansen, N.F.,
Harnish, K., Haudenschild, C.D., Heyer, N.I., Hims, M.M., Ho, J.T., Horgan,
A.M., Hoschler, K., Hurwitz, S., Ivanov, D.V., Johnson, M.Q., James, T., Jones,
T.A.H., Kang, G.D., Kerelska, T.H., Kersey, A.D., Khrebtukova, I., Kindwall,
A.P., Kingsbury, Z., Kokko-Gonzales, P.I., Kumar, A., Laurent, M.A., Lawley,
C.T., Lee, S.E., Lee, X., Liao, A.K., Loch, J.A., Lok, M., Luo, S., Mammen,
R.M., Martin, J.W., McCauley, P.G., McNitt, P., Mehta, P., Moon, K.W., Mul-
lens, J.W., Newington, T., Ning, Z., Ng, B.L., Novo, S.M., O’Neill, M.J., Os-
borne, M.A., Osnowski, A., Ostadan, O., Paraschos, L.L., Pickering, L., Pike,
A.C., Pike, A.C., Pinkard, D.C., Pliskin, D.P., Podhasky, J., Quijano, V.J.,
Raczy, C., Rae, V.H., Rawlings, S.R., Rodriguez, A.C., Roe, P.M., Rogers, J.,
Bacigalupo, M.C.R., Romanov, N., Romieu, A., Roth, R.K., Rourke, N.J., Rue-
diger, S.T., Rusman, E., Sanches-Kuiper, R.M., Schenker, M.R., Seoane, J.M.,
135
Shaw, R.J., Shiver, M.K., Short, S.W., Sizto, N.L., Sluis, J.P., Smith, M.A.,
Sohna, J.E.S., Spence, E.J., Stevens, K., Sutton, N., Szajkowski, L., Tregidgo,
C.L. Turcatti, G., VandeVondele, S., Verhovsky, Y., Virk, S.M., Wakelin, S.,
Walcott, G.C., Wang, J., Worsley, G.J., Yan, J., Yau, L., Zuerlein, M., Rogers,
J., Mullikin, J.C., Hurles, M.E., McCooke, N.J., West, J.S., Oaks, F.L.,
Lundberg, P.L., Klenerman, D., Durbin, R. and Smith A.J. (2008) Accurate
whole human genome sequencing using reversible terminator chemistry, Na-
ture, 456(7218):53-59
[15] McKernan, K.J., Peckham, H.E., Costa, G.L., McLaughlin, S.F., Fu, Y., Tsung,
E.F., Clouser, C.R., Duncan, C., Ichikawa, J.K., Lee, C.C., Zhang, Z., Ranade,
S.S., Dimalanta, E.T., Hyland, F.C., Sokolsky, T.D., Zhang, L., Sheridan, A.,
Fu, H., Hendrickson, C.L., Li, B., Kotler, L., Stuart, J.R., Malek, J.A., Maning,
J.M., Antipova, A.A., Perez, D.S., Moore, M.P., Hayashibara, K.C., Lyons,
M.R., Beaudoin, R.E., Coleman, B.E., Laptewicz, M.W., Sannicandro, A.E.,
Rhodes, M.D., Gottimukkala, R.K., Yang, S., Bafna, V., Bashir, A., MacBride,
A., Alkan, C., Kidd, J.M., Eichler, E.E., Reese, M.G., Vega, F.M.D.L and
Blanchard, A.P. (2009) Sequence and structural variation in a human genome
uncovered by short-read, massively parallel ligation sequencing using two-
based encoding, Genome Research, 19, 1527-1541
[16] Rothberg, J.M., Hinz, W., Rearick, T.M., Schultz, j., Mileski, W., Davey, M.,
Leamon, J.H., Johnson, k., Milgrew, M.J., Edwards, M. Hoon, J., Simons, J.F.,
Marran, D., Myers, J.W., Davidson, J.F., Branting, A., Nobile, J.R., Puc, B.P.,
Light, D., Clark, T.A., Huber, M., Branciforte, J.T., Stoner, I.B., Cawley, S.E.,
Lysons, M., Fu, Y., Homer, N., Sedova, M., Miao, X., Reed, B., Sabina, J.,
Feierstein, E., Schorn, M., Alanjary, M., Dimalanta, E., Dressman, D., Kasin-
skas, R., Sokolsky, T., Fidanza, J.A., Namsaraev, E., McKernan, K.J., Wil-
liamns, A., Roth, G.T. and Bustillo, J. (2011) An integrated semiconductor de-
vice enabling non-optical genome sequencing, Nature, 475(7356): 348-352
136
[17] Eid, J., Fehr, A., Gray, J., Luong, K., Lyle, J., Otto, G., Peluso. P., Rank, D.,
Baybayan, P., Bettman, B., Bibillo, A., Bjornson, K., Chaudhuri, B., Christians,
F., Cicero, R., Clark, S., Dalal, R., DeWinter, A., Dixon, J., Foquet, M.,
Gaertner, A., hardenbol, P., Heiner, C., Hester, K., Holden, D., Kearns, X.K.,
Kuse, R., Lacroix, Y., Lin, S., Lundquist, P., Ma, C., Marks, P., Maxham, M.,
Murphy, D., Park, I., Pham, T., Phillips, M., Roy, J., Sebra, R., Shen, G.,
Sorenson, J., Tomaney, A., Travers, K., Trulson, M., Vieceli, J., Wegener, J.,
Wu, D., Yang, A., Zaccarin, D., Zhao, P., Zhong, F., Korlach, J. and Turner, S.
(2009) Real-time DNA sequencing from single polymerase molecules, Science,
323(5910): 133-138
[18] Harismendy, O., Ng, P.C., Strausberg, R.L., Wang, X., Stock, T.B., Beeson,
K.Y., Schork, N.J., Murray, S.S., Topol, E.J., Levy, S. and Frazer, K.A. (2009)
Evaluation of next generation sequencing platforms for population targeted se-
quencing studies, Genome Biology, 10: R 32
[19] Bonetta, L. (2006) Genome sequencing in the fast lane, Nature Method, 3, 2,
141-147
[20] Weigel, D. and Mott, R. (2009) The 1001 genomes project for Arabidopsis tha-
liana, Genome Biology, 10, 107
[21] The 1000 Genomes Project Consortium, A map of human genome variation
from population-scale sequencing, Nature, 467, 1061-1073
[22] Genome 10K Community of Scientists, Genome 10K: a proposal to obtain
whole-genome sequencer for 10,000 vertebrate species, Journal of Heredity,
100, 659-674
[23] Staden, R. (1979) A strategy of DNA sequencing employing computer pro-
grams, Nucleic Acids Research, 6(7): 2601-10
137
[24] Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L. and Law, M.
(2012) Comparison of Next-Generation Sequencing Systems, Journal of Bio-
medicine and Biotechnology, 251364
[25] Berka, J., Chen, Y.J., Leamon, J.H., Lefkowitz, S., Lohman, K.L., Makhijani,
V.B., Rothberg, J.M., Sarkis, G.J., Srinivasan, M. and Weiner, M.P. (2005)
Bead emulsion nucleic acid amplification, U.S. Patent Application
[26] Foehlich, T., Heindl, D. and Roesler, A. (2010) High-throughput nucleic acid
analysis, U.S. Patent
[27] Ansorge, W.J. (2009) Next-generation DNA sequencing techniques, New Bio-
technology, 25, 4, 195-203
[28] Mardis, E.R. (2008) the impact of next-generation sequencing technology on
genetics, Trends in Genetics, 24, 3, 133-141
[29] Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R.,
Bertoni, A., Swerdlow, H.P. and Gu, Y. (2012) A tale of three next generation
sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illu-
mina MiSeq sequencers, BMC Genomics, 13, 341
[30] Bubnoff, A.V. (2008) Next-Generation Sequencing: The Race Is On, Cell, 132:
721-723
[31] Illumina (2010) Genomic Sequencing, Pub. No. 770-2008-016
[32] Kit, S. (1961) Equilibrium sedimentation in density gradients of DNA prepara-
tion from animal tissues, Journal of Molecular Biology, 3, 711-716
[33] McClintock, B. (1965) Components of action of the regulators Spm and Ac,
Camegie Inst. Wash. Yearb. 64, 527-534
138
[34] Britten, R.J. and Kohne, D.E. (1968) Repeated sequences in DNA, Science, 161,
529-540
[35] La Spada, A.R., Wilson, E.M., Lubahn, D.B., Harding, A.E. and Fischbeck,
K.H. (1991) Androgen receptor gene mutations in X-linked spinal and bulbar
muscular atrophy, Nature, 352, 77-79
[36] Bennetzen, J.L. (2000) Transposable element contributions to plant gene and
genome evolution, Plant Molecular Biology, 42, 251-269
[37] Dooner, H.K. and Weil, C.F. (2007) Give-and-take: interactions between DNA
transposons and their host plant genomes, Curr. Opin. Genet. Dev., 17, 486-492
[38] Lippman, Z., Gendrel, A.V., Black, M., Vaughn, M.W., Dedhia, N., McCombie,
W.R., Lavine, K., Mittal, V., May, B., Kasschau, K.D., Carrignton, J.C., Do-
erge, R.W., Colot, V. and Martienssen, R. (2004) Role of transposable elements
in heterochromatin and epigenetic control, Nature, 430, 471-476 Ouyang,S.
Buell,CR. (2004) The TIGR Plant Repeat Databases: a collective resource for
the identification of repetitive sequences in plants, Nucleic Acids Research, 32,
1: D360-D363
[39] Kumar, P., Chaitanya, P.S. and Nagarajaram, H.A. (2011) PSSRdb: a relational
database of polymorphic simple sequence repeats extracted from prokaryotic
genomes, Nucleic Acids Research, 39: D601-D605
[40] Subramanian, S., Madgula, V.M., George, R., Kumar, S., Pandit, M.W. and
Singh, L. (2003) SSRD: Simple Sequence Repeats Database of the Human Ge-
nome, Comparative and Functional Genomics, 4, 3, 342-345
[41] Gelfand, Y., Rodriguez, A. and Benson, G. (2007) TRDB: The Tandem Repeats
Database, Nucleic Acids Research, 35: D80-D87
139
[42] Nussbaumer, T., Martis, M.M., Roessner, S.K., Pfeifer, M., Bader, K.C., Shar-
ma, S., Gundlach, H. and Spannagl, M. (2013) MIPS PlantsDB: a database
framework for comparative plant genome research, Nucleic Acids Research, 41:
D1144-D1151
[43] Keller, B., Wicker, T. and Matthews, D.E. (2003) TREP: a database for Tritice-
ae repetitive elements, Trends in plant science, 7, 561-562
[44] Macas, J., Meszaros, T. and Nouzova, M. (2002) PlantSat: a specialized data-
base for plant satellite repeats, Bioinformatics, 18: 28-35
[45] Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O. and
Walichiewicz, J. (2005) Repbase Update, a database of eukaryotic repetitive
elements. Cytogenet Genome Res, 110, 462-467.
[46] Smit et al. www.repeatmasker.org
[47] Li, R., Ye, J., Li, S., Wang, J., Han, Y, Ye, C., Wang, J., Yang, H., Yu, J.,
Wong, G.K.S. and Wang, J. (2005) ReAS: recovery of ancestral sequences for
transposable elements from the unassembled reads of a whole genome shotgun.
PLos Computational Biology, doi: 10.1007/s12042-007-9007-5
[48] Price, A.L., Jones, N.C. and Pevzner, P.A. (2005) De novo identification of re-
peat families in large genomes, Bioinformatics, 21, i351-i358
[49] Surya, S., Bridges, S., Magbanua, Z.V. and Peterson, D.G. (2008) Empirical
comparison of ab initio repeat finding programs, Nucleic Acids Research, 36, 7,
2284-2294
140
[50] Sura, S., Bridges, S., Magbanua, Z.V. and Peterson, D.G. (2008) Computational
Approaches and Tools Used in Identification of Dispersed Repetitive DNA Se-
quence, Tropical Plant Biology, 1: 85-96
[51] Bergman, C.M. and Quesneville, H. (2007) Discovering and detecting transpos-
able elements in genome sequences, Briefings in Bioinformatics, 8(6): 382-392
[52] Benson, D.A. et al. (2013) GenBank, Nucleic Acids Research, 41: D36-D42
[53] Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X.,
Liu, B., Yang, B. and Fan, W. (2011) Comparison of the two major classes of
assembly algorithms: overlap-layout-consensus and de-bruijn-graph, Briefings
In Functional Genomics, 11,1, 25-37
[54] Salzberg, S.L., Phillippy, A.M., Zimin, A., Puiu, D., Magoc, T., Koren, S.,
Treangen, T.J., Schatz, M.C., Delcher, A.L., Roberts, M., Marcais, G., Pop, M.
and Yorke, J.A. (2012) GAGE: A critical evaluation of genome assemblies and
assembly algorithms, Genome Research, 22: 557-567
[55] Paszkiewicz, K. and Studholme, D.J. (2010) De novo assembly of short se-
quence reads, Briefings In Bioinformatics, 11, 5, 457-472
[56] Zhang, W., Chen, J., Yang, Y., Tang, Y., Shang, J. and Shen, B. (2011) A Prac-
tical Comparison of De Novo Genome Assembly Software Tolls for Next-
generation Sequencing Technologies, PLoS ONE, 6(3): e17915.
[57] Scheibye-Alsing, K., Hoffmann, S., Frankel, A., Jensen, P., Stadler, P.F., Mang,
Y., Tommerup, N., Gilchrist, M.J., Nygard, A.B., Jorgensen, C.B., Fredholdm,
M. and Gorodkin, J. (2009) Sequence assembly, Computational Biology and
Chemistry, 33, 121-136
141
[58] Staden, R. (1980) A new computer method for the storage and manipulation of
DNA gel reading data, Nucleic Acid Research, 8(16): 3673-3694
[59] Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Ber-
ger, B., Mesirov, J.P. and Lander, E.S. (2002) ARACHNE: a whole-genome
shotgun assembler, Genome Research, 12: 177-189
[60] Havlak, P., Chen, R., Durbin, K.J., Egan, A., Ren, Y., Song, X.Z., Weinstock,
G.M. and Gibbs, R.A. (2004) The Atlas genome assembly system, Genome Re-
search, 14: 721-732
[61] Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan,
M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H.J., Remington, K.A., Anson,
E.L., Bolanos, R.A., Chou, H.H., Jordan, C.M., Halpern, A.L., Lonardi, S.,
Beasley, E.M., Brandon, R.C., Chen, L., Dunn, P.J., Lai, Z., Liang, Y., Nussker,
D.R., Zhan, M., Zhang, Q., Zheng, X., Rubin, G.M., Adams, M.D. and Venter,
J.C. (2000) A whole-genome assembly of Drosophila, Science, 287: 2196-2204
[62] Huang, X. and Madan, A. (1999) CAP3: A DNA sequence assembly program,
Genome Research, 9: 868-877
[63] Pevzner, P.A., Tang, H. and Tesler, G. (2004) De novo repeat classification and
fragment assembly, Genome Research, 14: 1786-1796
[64] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben,
L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., Dewell, S.B., Du, L.,
Fierro, J.M., Gomes, X.V., Godwin, B.C., He, W., Helgesen, S., Ho, C.H., Ir-
zyk. G.P., Jando, S.C., Alenquer, M.L.I., Jarvie, T.P., Jirage, K.B., Kim, J.B.,
Knight, J.R., Lanza, J.R., Leamon, J.H., Lefkowitz, S.M., Lei, M., Li, J.,
Lohman, K.L., Lu, H., Makhijani, V.B., McDade,, K.E., McKenna, M.P., My-
ers, E.W., Nickerson, E., Nobile, J.R., Plant, R., Puc, B.P., Ronan, M.T., Roth,
G.T., Sarkis, G.J., Simons, J.F., Simpson, J.W., Srinivasan, M., Tartaro, K.R.,
142
Tomasz, A., Vogt, K.A., Volkmer, G.A., Wang, S.H., Wang, Y., Weiner, M.P.,
Yu, P., Begley, R.F. and Rothberg, J.M. (2005) Genome sequencing in open
microfabricated high density picoliter reactors, Nature, 437: 376-380
[65] Huang, X., Wang, J., Aluru, S., Yang S.P. and Hillier, L. (2003) PCAP: a
whole-genome assembly program, Genome Research, 13: 2164-2170
[66] Mullikin, J.C. and Ning, Z. (2003) The phusion assembler, Genome Research,
13: 81-90
[67] Bastide, M.D.L and McCombie, W.R. (2007) Assembling genomic DNA se-
quences with PHRAP, Curr Protoc Bioinformatics, Chapter 11, Unit 11.4
[68] Wang, J., Wong, G.K.S., Ni, P., Han, Y., Huang, X., Zhang, J., Ye, C., Zhang,
Y., Hu, J., Zhang, K., Xu, X., Cong, L., Lu, H., Ren, X., Ren, X., He, J., Tao,
L., Passey, D.A., Wang, J., Yang, Y., Yu, J. and Li, S. (2002) RePS: a sequence
assembler that masks exact repeats identified from the shotgun data, Genome
Research, 12: 824-831
[69] Pevzner, P.A., Tang, H. and Waterman, M.S. (2001) An Eulerian path approach
to DNA fragment assembly, PNAS, 98, 17, 9748-9753
[70] Pevzer, P.A. and Tang, H. (2001) Fragment assembly with double-barreled data,
Bioinformatics, 17:S225-233
[71] Butler, J. et al. (2008) ALLPATHS: de novo assembly of whole-genome shot-
gun microreads, Genome Research, 18: 810-820
[72] Maccallum, I., Przybylski D., Gnerre, S., Burton, J., Shlyakhter, I., Gnirke, A.,
Malek, J., McKernan, K., Ranade, S., Shea, T.P., Williams, L., Young, S., Nus-
baum C. and Jaffe, D.B. (2009) ALLPATHS 2: small genomes assembled accu-
143
rately and with high continuity from short paired reads, Genome Biology, 10:
R103
[73] Gnerre, S., Maccallum, I., Przybylski, D., Ribeiro, F.J., Burton, J.N., Walker,
B.J., Sharpe, T., Hall, G., Shea, T.P., Sykes, S., Berlin, A.M., Aird, D., Costello,
M., Daza, R., Williams, L., Nicol, R., Gnirke, A., Nusbaum, C., Lander, E.S.
and Jaffe, D.B. (2011) High-quality draft assemblies of mammalian genomes
from massively parallel sequence data. Proc Natl Acad Sci U S A, 108, 1513-
1518.
[74] Zerbino, D.R. and Birney, E. (2008) Velvet: algorithms for de novo short read
assembly using de Bruijn graphs, Genome Research, 18: 810-820
[75] Simpson, J.T. Wong, K., Jackman, S.D., Schein, J.E. Jones, S.J.M. and Birol, I.
(2009) ABySS: A parallel assembler for short read sequence data. Genome Re-
search, 19: 1117-1123
[76] Pevzner, P.A., Tang, H. and Tesler, G. (2004) De novo repeat classification and
fragment assembly, Genome Research, 14: 1786-1796
[77] Li, R., Zhu, H., Ruan, J., Qian, W., Fang, X., Shi, Z., Li, Y., Li, S., Shan, G.,
Kristiansen, K., Li, S., Yang, H., Wang, J. and Wang J (2010) De novo
assembly of human genomes with massively parallel short read sequencing.
Genome Res, 20, 265-272.
[78] Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan, Q.,
Liu, Y., Tang, J., Wu, G., Zhang, H., Shi, Y., Liu, Y., Yu, C., Wang, B., Lu, Y.,
Han, C., Cheung, D.W., Yiu, S.M., Peng, S., Zhu, X., Liu, G., Liao, X., Li, Y.,
Yang, H., Wang, J., Lam, T.W. and Wang, J. (2012) SOAPdenovo 2: an empiri-
cally improved memory-efficient short-read de novo assembler, GigaScience, 1:
18
144
[79] Conway, T.C. and Bromage, A.J. (2011) Succinct data structure for assembling
large genomes, Bioinformatics, 27: 479:486
[80] Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B.,
Bai, Y., Zhang, Z., Zhang, Y., Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J.,
Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., Ryder, O.A., Leung,
F.C.C., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou,
R., Shen, F., Mu, B., Ni., R., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W.,
Wang, J., WU, Z., Liang, H., Min, J., Wu, Q., Cheng, S., Ruan, J., Wang, M.,
Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G.,
Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C.C., Lam,
T.T.T., Lin, S., Zhang, Q., Li, G., Tian, J., Gong, T., Liu, H., Zhang, D., Fang,
L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., Bruford, M.W., Li,
Q., Ma, L., Guo, Y., Na, A., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen,
Y., Zhao, J., Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X.,
Vinar, T., Wang, Y., Lam, T.W., Yiu, S.M., Liu, S., Zhang, H., Li, D., Huang,
Y., Wang, X., Yang, G., Jiang, Z., Wang. J., Qin, N., Li, L., Li, J., Bolund, L.,
Kristiansen, K., Wong, G.K.S. Olson, M., Zhang, X., Li, S., Yang, H. and
Wang, J. (2010) The sequence and de novo assembly of the giant panda ge-
nome, Nature, 2010, 463: 311-317
[81] Huang, S., Li, R., Zhang, Z., Li, L., Gu, X., Fan, W., Lucas, W.J., Wang, X.,
Xie, B., Ni, P., Ren, Y., Zhu, H., Li, J., Lin, K., Jin, W., Fei, Z., Li, G., Staub, J.,
Kilian, A., Vossen, E.A.G.V.D., Wu, Y., Guo, J., He, J., Jia, Z., Ren, Y., Tian,
G., Lu, Y., Ruan, J., Qian, W., Wang, M., Huang, Q., Li, B., Xuan, Z., Cao, J.,
Asan, Wu, Z., Zhang, J., Cai, Q., Bai, Y., Zhao, B., Han, Y., Li, Y., Li, X.,
Wang, S., Shi, Q., Liu, S., Cho, W.K., Kim, J., Xu, Y., Heller-Uszynska, K.,
Miao, H., Cheng, Z., Zhang, S., Wu, J., Yang, Y., Kang, H., Li, M., Liang, H.,
Ren, X., Shi, Z., Shi, Z., Wen, M., Jian, M., Yang, H., Zhang, G., Yang, Z.,
Chen, R., Liu, S., Li, J., Ma, L., Liu, H., Zhou, Y., Zhao, J., Fang, X., Li, G.,
Fang, L., Li, Y., Liu, D., Zheng, H., Zhang, Y., Qin, N., Li, Z., Zhang, X.,
Yang, H., Wang, J., Sun, R., Zhang, B., Jiang, S., Wang, J, Du, Y. and Li, S
145
(2009) The genome of the cucumber, Cucumis sativus L. Nature Genetics, 41:
1275-1281
[82] Warren, R.L., Sutton, G.G., Jones, S.J. and Holt, R.A. (2007) Assembling mil-
lions of short DNA sequences using SSAKE, BioinfoBrmatics, 23: 500-501
[83] Dohm, J.C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2007) SHARCGS, a
fast and highly accurate short-read assembly algorithm for de novo genomic se-
quencing, Genome Research, 23: 2942-2944
[84] Jeck, W.R., Reinhardt, J.A., Baltrus, D.A., Hickenhotham, M.T., Magrini, V.,
Mardis, E.R., Dangl, J.L. and Jones, C.D. (2007) Extending assembly of short
DNA sequences to handle error, Bioinformatics, 23:2942-2944
[85] Bryant, D.W., Wong, W.K. and Mockler, T.C. (2009) QSRA: a quality-value
guided de novo short read assembler, BMC Bioinformatics, 10: 69
[86] Simpson, J.T. and Durbin, R. (2012) Efficient de novo assembly of large ge-
nomes using compressed data structures, Genome Research, 22: 549-556
[87] Boisvert, S., Laviolette, F. and Corbeil, J. (2010) Ray: Simultaneous Assembly
of Reads from Mix of High-Throughput Sequencing Technologies, Journal of
Computational Biology, 17, 11, 1519-1533
[88] Schmidt, B., Sinha, R., Smith, B.B. and Puglisi, S.J. (2009) A fast hybrid short
read fragment assembly algorithm, Bioinformatics, 25, 17, 2279-2280
[89] Boetzer, M., Henkel, C.V., Jansen, H.J., Butler, D. and Pirovano, W. (2011)
Scaffolding pre-assembled contigs using SSPACE, Bioinformatisc, 27: 578-579
146
[90] Zerbino, D.R., McEwen, G.K., Margulies, E.H. and Birney, E. (2009) Pebble
and rock band: heuristic resolution of repeats and scaffolding in the velvet
short-read de novo assembler, PLoS One, 4: e8407
[91] Dayarian, A., Michael, T.P. and Sengupta, A.M. (2010) SOPRA: scaffolding
algorithm for paired reads via statistical optimization, BMC Bioinformatics, 11:
345
[92] Koren, S., Miller, J.R., Walenz, B. and Sutton, G. (2010) An algorithm for au-
tomated closure during assembly, BMC Bioinformatics, 11: 457
[93] Tsai, I.J., Otto, T.D. and Berriman, M. (2010) Improving draft assemblies by
iterative mapping and assembly of short reads to eliminate gaps, Genome Biolo-
gy, 11: R41
[94] Chaisson, M.J., Brinza, D. and Pevzner, P.A. (2009) De novo fragment assem-
bly with short mate-paired reads: Does the read length matter? Genome Re-
search, 19: 336-346
[95] Earl, D., Bradnam, K., John, J.S., Darling, A., Lin, D., Fass, J., Yu, H.O.K.,
Buffalo, V., Zerbino, D.R., Diekhans, M., Nguyen, N., Ariyaratne, P.N., Sung,
W.K., Ning, Z., Haimel, M., Simpson, J.T., Fonseca, N.A., Birol, I., Docking,
T.R., Ho, I.Y., Rokhsar, D.S., Chikhi, R., Lavenier, D., Chapuis, G., Naquin, D.,
Maillet, N., Schatz, M.C., Kelley, D.R., Phillippy, A.M., Koren, S., Yang, S.P.,
Wu, W., Chou, W.C., Srivastava, A., Shaw, T.I., Ruby, J.G., Skewes-Cox, P.,
Betegon, M., Dimon, M.T., Solovyev, V., Seledtsov, I., Kosarev, P., Vorobyev,
D., Ramirez-Gonzalez, R., Leggett, R., Maclean, D., Xia, F., Luo, R., Li, Z.,
Xie, Y., Liu, B., Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F.J., Yin,
S., Sharpe, T., Hall, G., Kersey, P.J., Durbin, R., Jackman, S.D., Chapman, J.A.,
Huang, X., DeRisi, J.L., Caccamo, M., Li, Y., Jaffe, D.B., Green, R.E.,
Haussler, D., Korf, I. and Paten, B. (2011) Assemblathon 1: A competitive as-
147
sessment of de novo short read assembly methods, Genome Research, 21: 2224
– 2241.
[96] Parra, G., Bradnam, K., Ning, Z., Keane, T. and Korf, I. (2009) Assessing the
gene space in draft genomes, Nucleic Acids Research, 37: 289-297
[97] Schatz, M.C., Delcher, A.L. and Salzberg, S.L. (2 010) Assembly of large ge-
nomes using second-generation sequencing, Genome Research, 20: 1165-1173
[98] Bresler, M., Sheehan, S., Chan, A.H. and Song, Y.S. (2012) Telescoper: de no-
vo assembly of highly repetitive regions, Bioinformatics, 28: i311-i377.
[99] Pop, M. (2009) Genome assembly reborn: recent computational challenges,
Briefings in Bioinformatics, 10, 4, 354-366
[100] Myers, E.W. (2005) The fragment assembly string graph, Bioinformatics, 21:
ii79-ii85
[101] Metzker, M.L. (2010) Sequencing technologies - the next generation. Nat Rev
Genet, 11, 31-46.
[102] Miller, J.R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-
generation sequencing data. Genomics, 95, 315-327.
[103] El-Metwally, S., Hamza, T., Zakaria, M. and Helmy, M. (2013) Next-
generation sequence assembly: four stages of data processing and
computational challenges. PLoS Comput Biol, 9, e1003345.
[104] Ohno, S. (1972) So much "junk" DNA in our genome. Brookhaven Symp Biol,
23, 366-370.
148
[105] Doolittle, W.F. and Sapienza, C. (1980) Selfish genes, the phenotype
paradigm and genome evolution. Nature, 284, 601-603.
[106] Orgel, L.E. and Crick, F.H. (1980) Selfish DNA: the ultimate parasite. Nature,
284, 604-607.
[107] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J.,
Devon, K., Dewar, K., Doyle, M., FitzHugh, W. et al. (2001) Initial
sequencing and analysis of the human genome. Nature, 409, 860-921.
[108] Rinn, J.L. and Chang, H.Y. (2012) Genome regulation by long noncoding
RNAs. Annu Rev Biochem, 81, 145-166.
[109] Guttman, M., Amit, I., Garber, M., French, C., Lin, M.F., Feldser, D., Huarte,
M., Zuk, O., Carey, B.W., Cassady, J.P. et al. (2009) Chromatin signature
reveals over a thousand highly conserved large non-coding RNAs in
mammals. Nature, 458, 223-227.
[110] Rollins, R.A., Haghighi, F., Edwards, J.R., Das, R., Zhang, M.Q., Ju, J. and
Bestor, T.H. (2006) Large-scale structure of genomic methylation patterns.
Genome Res, 16, 157-163.
[111] Smit, A.F. (1996) The origin of interspersed repeats in the human genome.
Curr Opin Genet Dev, 6, 743-748.
[112] Subramanian, S., Mishra, R.K. and Singh, L. (2003) Genome-wide analysis
of microsatellite repeats in humans: their abundance and density in specific
genomic regions. Genome Biol, 4, R13.
[113] Gemayel, R., Vinces, M.D., Legendre, M. and Verstrepen, K.J. (2010)
Variable tandem repeats accelerate evolution of coding and regulatory
sequences. Annu Rev Genet, 44, 445-477.
149
[114] O'Dushlaine, C.T., Edwards, R.J., Park, S.D. and Shields, D.C. (2005)
Tandem repeat copy-number variation in protein-coding regions of human
genes. Genome Biol, 6, R69.
[115] Mularoni, L., Guigo, R. and Alba, M.M. (2006) Mutation patterns of amino
acid tandem repeats in the human proteome. Genome Biol, 7, R33.
[116] McMurray, C.T. (2010) Mechanisms of trinucleotide repeat instability during
human development. Nat Rev Genet, 11, 786-799.
[117] MacDonald, M.E., et al. (1993) A novel gene containing a trinucleotide
repeat that is expanded and unstable on Huntington's disease chromosomes.
The Huntington's Disease Collaborative Research Group. Cell, 72, 971-983.
[118] La Spada, A.R., Wilson, E.M., Lubahn, D.B., Harding, A.E. and Fischbeck,
K.H. (1991) Androgen receptor gene mutations in X-linked spinal and bulbar
muscular atrophy. Nature, 352, 77-79.
[119] Hannan, A.J. (2010) Tandem repeat polymorphisms: modulators of disease
susceptibility and candidates for 'missing heritability'. Trends Genet, 26, 59-
65.
[120] Boland, C.R., Thibodeau, S.N., Hamilton, S.R., Sidransky, D., Eshleman, J.R.,
Burt, R.W., Meltzer, S.J., Rodriguez-Bigas, M.A., Fodde, R., Ranzani, G.N.
et al. (1998) A National Cancer Institute Workshop on Microsatellite
Instability for cancer detection and familial predisposition: development of
international criteria for the determination of microsatellite instability in
colorectal cancer. Cancer Res, 58, 5248-5257.
[121] Press, M.O., Carlson, K.D. and Queitsch, C. (2014) The overdue promise of
short tandem repeat variation for heritability. Trends Genet, 30, 504-512.
150
[122] Walsh, P.S., Fildes, N.J. and Reynolds, R. (1996) Sequence analysis and
characterization of stutter products at the tetranucleotide repeat locus vWA.
Nucleic Acids Res, 24, 2807-2812.
[123] Luo, R., Liu, B., Xie, Y., Li, Z., Huang, W., Yuan, J., He, G., Chen, Y., Pan,
Q., Liu, Y. et al. (2012) SOAPdenovo2: an empirically improved memory-
efficient short-read de novo assembler. Gigascience, 1, 18.
[124] Gymrek, M., Golan, D., Rosset, S. and Erlich, Y. (2012) lobSTR: A short
tandem repeat profiler for personal genomes. Genome Res, 22, 1154-1162.
[125] Highnam, G., Franck, C., Martin, A., Stephens, C., Puthige, A. and Mittelman,
D. (2013) Accurate human microsatellite genotypes from high-throughput
resequencing data using informed error profiles. Nucleic Acids Res, 41, e32.
[126] Cao, M.D., Tasker, E., Willadsen, K., Imelfort, M., Vishwanathan, S.,
Sureshkumar, S., Balasubramanian, S. and Boden, M. (2014) Inferring short
tandem repeat variation from paired-end short reads. Nucleic Acids Res, 42,
e16.
[127] Noe, L. and Kucherov, G. (2005) YASS: enhancing the sensitivity of DNA
similarity search. Nucleic Acids Res, 33, W540-543.
[128] Koren, S., Treangen, T.J. and Pop, M. (2011) Bambus 2: scaffolding
metagenomes. Bioinformatics, 27, 2964-2971.
[129] Miller, J.R., Delcher, A.L., Koren, S., Venter, E., Walenz, B.P., Brownley, A.,
Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly
of pyrosequencing reads with mates. Bioinformatics, 24, 2818-2824.
151
[130] Benson, G. (1999) Tandem repeats finder: a program to analyze DNA
sequences. Nucleic Acids Res, 27, 573-580.
[131] Bao, Z. and Eddy, S.R. (2002) Automated de novo identification of repeat
sequence families in sequenced genomes. Genome Res, 12, 1269-1276.
[132] Koch, P., Platzer, M. and Downie, B.R. (2014) RepARK--de novo creation of
repeat libraries from whole-genome NGS reads. Nucleic Acids Res, 42, e80.
[133] Darzentas, N. (2010) Circoletto: visualizing sequence similarity with Circos.
Bioinformatics, 26, 2620-2621.
[134] Dear, S. and Standen, R. (1992) A standard file format for data from DNA
sequencing instruments, DNA Seq., 3(2): 107-10.
[135] Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1998) Base-calling of auto
mated sequencer traces using phred. I. Accuracy assessment, Genome
Research, 8 (3): 186-194.
[136] Ewing, B. and Green, P (1998) Base-calling of automated sequencer traces us
ing phred. II. Error probabilities, Genome Research, 8 (3): 186-194.
[137] Richterich, P. (1998) Estimation of errors in raw DNA sequences: a valida-
tion study, Genome Research, 8 (3): 251-259.
[138] Schneeberger, K., Ossowski, S., Ott, F., Klein, J.D., Wang, X., Lanz, C.,
Smith, L.M., Cao, J., Joffrey, F., Warthmann, N., Henz, S.R., Huson, D.H.
and Weigel, D. (2011) Reference-guided assembly of four diverse Arabidop-
sis thaliana ge nomes, PNAS, vol. 108 no. 25.
[139] Pelak, K., Shianna, K.V., Ge, D., Maia, J.M., Zhu, M., Smith, J.P., Cirulli,
E.T., Fellay, J., Dickson, S.P., Gumbs, C.E., Heinzen, E.L., Need, A.C., Ruz-
152
zo, E.K., Singh, A., Cambell, C.Y., Hong, L.K., Lornsen, K.A., McKenzie,
A.M., Sobreira, N,L,M., Hoover-Fong, J.E., Milner, J.D., Ottman, R., Haynes,
B.F., Goedert, J.J. and Goldstein, D.B. (2010) The characterization of twenty
sequenced human genomes. PLoS Genet 6:e1001111.
[140] Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M., and Yorke, J. A. (2004)
Reducing storage requirements for biological sequence comparison. Bioinfo
matics 20, 3363–3369.