UNIVERSIDADE DE LISBOA
FACULDADE DE CIÊNCIAS
DEPARTAMENTO DE INFORMÁTICA
Development of a Scalable, Precise, and
High-Coverage Genomics Mapping Tool for NGS
Natacha Alexandra Pinheiro Leitão
Orientação: Professor Doutor Francisco José Moreira Couto
e Professor Doutor João Carlos Antunes Leitão
MESTRADO EM BIOINFORMÁTICA E BIOLOGIA COMPUTACIONAL
Especialização em BioinformáticaDissertação
2015
"Science, my lad, is made up of mistakes, but they are mistakes which it is useful
to make, because they lead little by little to the truth."
- Jules Verne in "A Journey to the Center of the Earth"
Agradecimentos
Em primeiro lugar, tenho a agradecer ao Professor Doutor Francisco Couto, que não só
aceitou orientar-me nos meus trabalhos de mestrado, como me inspirou a arriscar, a pôr à prova
a minha capacidade de aprendizagem e adaptação, e a desenvolver o meu sentido de investiga-
ção. Ao Doutor João Leitão, cuja co-orientação foi fundamental na concretização deste projeto,
agradeço o seu companheirismo e paciência infinita. E, claro, estou muito grata aos dois pela
motivação, confiança e disponibilidade com que me encaminharam nesta jornada que, apesar
de morosa e frustrante (em alguns momentos), me deu tanto a aprender. Acho que me vou
lembrar de muito do que me disseram e que com eles aprendi durante toda a minha vida.
Agradeço à minha família pelo seu amor incondicional, apoio e orgulho em mim, por acre-
ditarem no meu sucesso, independentemente das minhas escolhas e terem-me permitido e mo-
tivado a chegar até aqui e a ir cada vez mais longe. Agradeço especialmente: à minha Mãe, por
todas as lágrimas que enxaguou, pelos abraços cheios de mimo, sobretudo nos momentos em
que a confiança me falhou (quase completamente) e pelo espaço e tempo que me deu; à mi-
nha irmã Sofia, por me aturar longe e perto, levar a desanuviar e a divertir (como só consigo
com ela), ajudar-me a ir atrás dos meus objetivos e por ser "A"minha irmã mais velha; ao meu
Pai, que mesmo distante esteve presente, por todas as conversas cheias de carinho e por me ter
ensinado que "a máquina tem sempre razão"(cresci a ouvi-lo e isso ajudou-me bastante neste
desafio); à minha Avó Celeste pela visita, pela companhia na minha ida ao Porto para apresen-
tar o meu trabalho, e a quem me desculpo por não ter estado mais disponível; ao meu Padrinho
pelo seu carinho, preocupação, pelas conversas elucidativas em relação às minhas opções e es-
colhas passadas, presentes e futuras. E não podia deixar de estar grata pela companhia dos
meus gatos, Milo e Gôdo, por não terem desistido dos meus afetos, tantas vezes penalizados
pela minha indisponibilidade (nos últimos tempos), privados do mimo que merecem.
Aos meus amigos, lamento, eu vou encontrar outro drama com que lhes chatear a cabeça
mas obrigada, do fundo do meu coração, por me terem ajudado a ultrapassar este. Em especial
estou grata: à Cristina, por toda a sua paciência e tempo, pelos sermões que precisei ouvir, e por
me adorar ao ponto de continuar a querer aturar-me apesar da minhas "tempestades num copo
de água"; à Flávia, pela sua companhia durante todo o mestrado, dentro e fora da faculdade, e
v
por todas as palavras de apoio e motivação durante as minhas crises existenciais (fosse qual
fosse a causa); à Lara, por ser o meu "rubber ducky", pela companhia nas pausas cheias de
desespero ou de conversas sobre tudo e nada, por também ter ido a Braga e deixar-me arrastá-la
para passeios; ao Bruno, por me ajudar com as minhas dúvidas de programação, pelos "acaba-
me mas é isso!"que tantas vezes ouvi e por levar com a minha conversa infindável by proxy; à
Ana Cláudia, por me ouvir pacientemente a explicar o projeto (e os seus problemas), pelas vezes
que me obrigou a sair da bolha para desanuviar e por todo o carinho que continua a ter por mim;
à Mariana, pela sua amizade e apoio, que sempre existiram mesmo com a falta de tempo, e ler os
meus testamentos via SMS; à Raquel, pelos passeios e desculpas para fugir do laboratório e dos
meus problemas; à Ana Marta, por me deixar refugiar no laboratório do C8 (e lavar o material
de vidro), ouvir-me divagar enquanto trabalha e pelas conversas sobre gatos (ou como nós os
vemos, as nossas crianças não humanas); ao Rafael pelo apoio mútuo no nosso desespero; à
Cíntia, pela disponibilidade, paciência e ajuda durante as minhas crises existenciais e de auto-
confiança durante a escrita; à Filipa, pela admiração e por, praticamente desde o primeiro dia
da nossa amizade, me motivar a chegar ao fim do projeto por querer assistir à minha prova de
mestrado; ao Hugo e à Ioana por me esclarecem as minhas dúvidas sobre análise de dados de
NGS.
Por fim, agradeço aos meus colegas do XLDB, Cátia Pesquita, Hugo Bastos e João Ferreira
por me acolherem; ao Pedro Gonçalves e à D. Sandra Crespo pela disponibilidade e simpatia;
aos meus colegas do laboratório 6.3.30 pelas suas conversas inusitadas que tantas vezes me
quebraram o aborrecimento; e, aos membros de Computer Systems do NOVA-LINCS por me
fazerem sentir tão bem-vinda em todas as minhas idas ao seu departamento.
N. P. L.
Sintra, 25 de Setembro de 2015.
vi
Resumo
O ácido desoxirribonucleico (DNA) é das macromoléculas biológicas mais conhecidas na
sociedade, e continua a ser um grande alvo de investigação. No início dos anos 90 do século
passado, o Projeto do Genoma Humano começou com o objetivo de sequenciar a informação
contida no DNA. Passado treze anos, e cinquenta anos depois de Watson e Crick terem revelado
a estrutura em hélice dupla do DNA, a primeira sequência genómica humana foi apresentada
quase na sua totalidade; o Projeto do Genoma Humano chegara ao fim, mas não sem mudar
as ciências biológicas e a investigação biomédica. O desenvolvimento das tecnologias de se-
quenciação e a disponibilidade de sequências genómicas de organismos modelo, para além do
humano, levou a que a resequenciação se tornasse um método bastante utilizado para ler a
informação guardada no genoma. Hoje em dia, a nova geração de tecnologias de sequencia-
ção (NGS Technologies) permitem a produção rápida e a baixo custo de milhares de milhões
de pequenos fragmentos de DNA em bruto — usualmente referidos como reads — e um passo
importante na análise destes dados é alinhar as reads à sequência de um genoma de referência
para determinar o local onde pertencem, isto é, fazer o seu mapeamento.
O processo de mapear a vasta quantidade de pequenos fragmentos de DNA, com algu-
mas centenas de bases, a uma sequência genómica, por vezes bastante comprida (por exemplo,
o genoma humano tem mais de 3 milhões de pares de bases) é computacionalmente dispen-
dioso. Mais, neste processo é fundamental distinguir entre erros técnicos de sequenciação e
variações genéticas que ocorreram naturalmente (e que por vezes levam a doenças) no sujeito
da amostra. Para fazer frente ao desafio, muitas ferramentas têm vindo a ser desenvolvidas
utilizando abordagens algorítmicas diferentes, sendo que algumas delas incluem a informação
sobre a qualidade associada a cada base sequenciada, o que reporta a probabilidade de estar
errada.
Um dos métodos utilizados no mapeamento envolve procurar ao longo da sequência ge-
nómica de referência uma subsequência de uma read; depois, é feito o alinhamento entre a
read inteira e a correspondente zona do genoma. Neste trabalho, esta correspondência é ba-
seada numa tabela de dispersão (hash table) — uma estrutura de dados que associa chaves de
pesquisa a valores — que guarda pequenas subsequências do genoma e as correspondentes po-
vii
sições na sequência, funcionando como um índice remissivo. Vários algoritmos para criar as
chaves de pesquisas — subsequências — para cada read foram implementados em Java, onde a
ideia principal é que associando mais do que uma chave a cada read aumentamos a hipótese de
encontrar o local a que ela pertence, e consequentemente de mapear todo conjunto de dados.
Neste sentido, a nossa solução para o mapeamento de reads é baseado no paradigma da
programação modular, em que cada módulo é responsável por uma parte de uma série de ta-
refas onde duas se destacam: a criação de chaves de pesquisa e o alinhamento. Na criação de
chaves de pesquisa os nossos algoritmos têm em conta a semelhança entre as bases do DNA
e/ou os valores de qualidade associados às bases que compõem a read, onde partindo de uma
subsequência a troca com as restantes bases permite gerar novas chaves de pesquisa. Também
implementámos um método existente na literatura que divide a read em partes iguais e sobre-
postas.
Para o alinhamento foram implementadas três versões do método de Needleman-Wunsch,
um algoritmo de programação dinâmica específico para o alinhamento global de sequências
biológicas, que tem em conta inserções e deleções de bases no genoma da amostra em rela-
ção ao de referência. Ao alinhamento entre as duas sequências é somada uma pontuação que
mede a semelhança entre elas; deste modo, numa implementação simples apenas existem duas
situações: há correspondência entre a base da read e a do genoma ou não, havendo uma pena-
lização. Quando recorremos a uma matriz de semelhança de bases, encontrando a mesma base
no alinhamento das sequências é dada a pontuação máxima, se as bases forem estruturalmente
parecidas é dada uma pontuação menor, e em caso de não haver de todo correspondência há
uma penalização. Por fim, implementámos uma versão baseada numa existente na literatura,
que inclui a probabilidade de cada uma das quatro bases poder ser a correta no cálculo da pon-
tuação em relação à sequência genómica, e as inserções e deleções (que levam há falta de cor-
respondência, ou seja, a um intervalo entre a sequências) têm uma penalização maior.
Dado que temos vários algoritmos para a criação de chaves e para o alinhamento, uma
vantagem da nossa abordagem modular é podermos experimentar combinações diferentes en-
tre eles. As combinações possíveis foram testadas com conjuntos de dados artificiais — as reads
foram retiradas de posições conhecidas de uma sequência de referência —, construídos com
viii
valores de qualidade reais e com a simulação dos erros técnicos de sequenciação mais comuns.
A avaliação dos resultados incluiu o tempo de execução — escalabilidade —, se mapeou todas
as reads do conjunto de dados — cobertura —, e se mapeou todas as reads no local correto do
genoma de referência — precisão. Em relação ao último parâmetro ainda se considerou o facto
da read ter sido mapeada a mais do que uma posição e se foi mapeada a um ou a mais pos-
síveis locais com uma pontuação de alinhamento relevante, ou seja, cujo alinhamento tenha
resultado de uma correspondência superior a 85%. Considerar várias posições de mapeamento
para uma read é um aspeto importante, por um lado, o número de fragmentos de DNA repe-
tidos ao longo do genoma de várias espécies é um problema, por outro, alguns protocolos de
NGS dependem da quantidade de reads mapeadas a um local (como o do ChIP-Seq). Contudo,
o mapeamento incorreto pode levar a erros nos passos seguintes da análise dos dados, como
a falsa deteção de polimorfismos de nucleótido único (SNPs, do inglês single nucleotide poly-
morphisms) e do número de cópias variantes (CNVs, do inglês copy number variants). Algumas
ferramentas foram criadas orientadas para a precisão devolvendo a melhor localização para
cada read, descartando os reads restantes. No entanto, outras focadas na detecção de SNPs e
de variações de nucleótido único (SNVs, do inglês single nucleotide variations) têm em conta os
vários locais de mapeamento, sendo que o conjunto das bases mapeadas a uma dada posição
confere um grau de certeza.
Por fim, da avaliação feita ao nosso protótipo para mapeamento de reads concluímos que
é preciso melhorar a escalabilidade para que seja possível aplicarmos a ferramenta a conjuntos
de dados reais com uma dimensão bastante superior à testada. Uma vez que utilizámos dados
artificiais, tal como esperado a cobertura foi total, ou seja, todas as reads foram mapeadas à
sequência de referência. As chaves de pesquisa correspondentes a partes sobrepostas das reads
levaram a uma precisão perfeita — 100% das reads dos conjuntos simulados foram mapeadas
ao seu local de origem na sequência de referência — e a que mais reads tenham encontrado
a sua melhor posição, o que chega a ser cerca de 94% de um conjunto de dados. A versão do
método de alinhamento de Needleman-Wunsch enriquecido com uma matriz de semelhança
de bases conduz mais reads a descobrirem outras localizações possíveis, o que é reflectido num
aumento até 7%, tendo aceite variações nucleotídicas no alinhamento.
ix
Reads reais de Escherichia coli UTI89 foram mapeadas ao seu genoma, o que nos permitiu
confirmar as nossas observações sobre a escalabilidade. No entanto, apesar dos resultados ob-
tidos com os dados artificiais, com este conjunto de dados apenas as chaves de pesquisa criadas
a partir de partes sobrepostas tornaram possível que as respetivas reads encontrassem o melhor
e outros possíveis locais no genoma. Estes resultados melhoraram quando se combinou o algo-
ritmo com a versão do método de alinhamento de Needleman-Wunsch enriquecido com uma
matriz de semelhança de bases, levando a que 28% das reads encontrasse a sua melhor posição
e 11% outras possíveis posições. Por outro lado, quanto maior o número de chaves associado a
cada read maior o número de reads mapeadas, resultando num cobertura de 100%.
O trabalho futuro deverá incluir o melhoramento da escalabilidade — o que poderá in-
cluir soluções de computação na nuvem —, a gravação dos resultados do mapeamento num
ficheiro de formato SAM e a adaptação da ferramenta a reads emparelhadas, cujo mapeamento
exige uma distância máxima entre elas o que torna o mapeamento mais fiável. Adicionalmente,
a característica modular do nosso protótipo permite experimentar outros algoritmos, para as
tarefas de criar chaves de pesquisa e alinhar as sequências, e expandir a ferramenta para ou-
tras funções, que sejam específicas da aplicação de NGS que produziu os dados (por exemplo,
Bisulfite-seq).
O código da implementação encontra-se disponível num repositório público1.
Palavras-Chave
DNA; Tecnologias de NGS; Algoritmos; Mapeamento de reads
1https://github.com/NatachaPL/LLC-Read-Mapping-Pipeline.git
x
Abstract
Mapping is a computationally expensive process, because it involves aligning a large amount of
reads with a few hundreds of base to a wide reference genome (e.g., the human genome has over
3 million base-pairs). Moreover, a major challenge is to distinguish technical sequencing errors
from biological variations that may occur in the sample. The work presented in this thesis aims
to face the mapping challenges by developing a tool that explores and enhances hash-based
approaches to increase the search space over the reference genome with the generation of mul-
tiple keys for each read, taking into account quality information and/or biological constraints in
the alignment; these keys generating algorithms were combined with different read alignment
strategies based in the Needleman-Wunsch method in a read mapping pipeline.
Finally, we evaluated our prototype regarding scalability — the time required to be ex-
ecuted —, coverage — the percentage of reads that are effectively mapped — and precision —
mapping reads to the correct location in the reference genome — with simulated datasets. Al-
though in terms of scalability much work has to be done, all the algorithmic combinations led
to a perfect coverage of the simulated datasets. As for precision, we observed that generating
multiple keys by dividing the reads in overlapping pieces is the best approach, leading to 100%
of the reads to be mapped at their original location. On the other hand, relying on a base sim-
ilarity matrix to perform the alignment led to more reads discovering other possible locations,
resulting in a 7% increase; this is a particularly interesting result when dealing with real datasets
because of the repetitive DNA sequences and genetic variations, that may occur within the gen-
ome. We also mapped real reads of Escherichia coli UTI89 to its genome sequence, allowing us
to confirm the observations about scalability, to realise that the algorithmic combination from
above is more suited to find a best and other possible locations for the reads within the gen-
ome, accounting for the 28% and 11% of reads obtained respectively for each task. Moreover, by
assigning more than one key to the reads we improved the coverage to 100%.
Keywords
DNA; NGS technologies; Algorithms; Read mapping
xi
Contents
List of Figures xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 7
2.1 From DNA Discovery to the Human Genome Sequence . . . . . . . . . . . . . . . . 7
2.2 Next-Generation Sequencing (NGS) Technologies . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Comparison Between NGS Platforms . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Errors and Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Third Generation Sequencing Technologies . . . . . . . . . . . . . . . . . . . 15
2.2.4 The FASTQ File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Survey of Read Mapping Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Algorithms based on Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Algorithms based on Burrows-Wheeler Transform (BWT) . . . . . . . . . . . 22
2.3.3 Best-mapper vs All-mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Genomics meets Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
xiii
Contents
3 Read Mapping Pipeline 27
3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.4 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5 Exploder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.6 Aligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.7 Combiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.8 Abstract Classes Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Results and Discussion 45
4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Escherichia coli UTI89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Conclusions and Future Work 67
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Bibliography 71
xiv
List of Figures
1.1 Cost per Raw Megabase of DNA Sequence. . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Pairing of nucleotide bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Work flow of Sanger sequencing method versus second-generation sequencing. 12
2.3 Extract from a file in FASTQ format. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 Read Mapping Pipeline scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Example of a .properties file content. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Scheme of the algorithm for the best key explosion. . . . . . . . . . . . . . . . . . . 35
3.5 Definition of transition and transversion. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Similarity Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1 Runtime vs Number of Reads (NW). Results from Machine 1. . . . . . . . . . . . . 48
4.2 Runtime vs Number of Reads (NW). Results from Machine 2. . . . . . . . . . . . . 48
4.3 Runtime vs Number of Reads (NW plus SM). Results from Machine 1. . . . . . . . 49
4.4 Runtime vs Number of Reads (NW plus SM). Results from Machine 2. . . . . . . . 49
4.5 Runtime vs Number of Reads (GNUMAP-based NW). Results from Machine 1. . . 50
4.6 Runtime vs Number of Reads (GNUMAP-based NW). Results from Machine 2. . . 50
4.7 Runtime vs Exploder and Aligner Combination (1000 reads). Results from Ma-
chine 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xv
List of Figures
4.8 Runtime vs Exploder and Aligner Combination (1000 reads). Results from Ma-
chine 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.9 Runtime vs Exploder and Aligner Combination (2000 reads). Results from Ma-
chine 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.10 Runtime vs Exploder and Aligner Combination (2000 reads). Results from Ma-
chine 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.11 Runtime vs Exploder and Aligner Combination (3000 reads). Results from Ma-
chine 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.12 Runtime vs Exploder and Aligner Combination (3000 reads). Results from Ma-
chine 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.13 Mapping Results for 1000 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.14 Mapping Results for 2000 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.15 Mapping Results for 3000 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.16 Rate of Reads with other Possible Locations. . . . . . . . . . . . . . . . . . . . . . . 59
4.17 Rate of Reads with a Best Location found. . . . . . . . . . . . . . . . . . . . . . . . . 60
4.18 Rate of Incorrectly Mapped Reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.19 Runtime vs Exploder and Aligner Combination (E. coli UTI89). . . . . . . . . . . . 63
4.20 Mapping Results for E. coli UTI89. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
xvi
Chapter 1
Introduction
The beginning of the 21st century was marked by the "essentially complete" human genome
sequence (Collins et al., 2004), which led to the sudden evolution in sequencing technologies
(McPherson, 2014). This brought many challenges to bioinformatics mostly due to the availabil-
ity of an increasing amount of data at decreasing costs (Figure 1.1). From software development
for de novo assembly or sequence alignment to the design of new data-structures (Ferragina and
Mishra, 2014), without forgetting the solutions within cloud computing and big data technolo-
gies (O’Driscoll et al., 2013), today’s bioinformaticians have a lot to explore, and improve, before
the future arrives.
One feature inherent to the next-generation sequencing (NGS) technologies is the fast
production of billions of raw short contiguous DNA fragments, usually denominated as reads.
With the availability of model organisms genome sequences, particularly the human genome
sequence, an important step of NGS data analysis is the mapping process, i.e., aligning the read
to a known reference genome, as to determine its location.
1
1. Introduction
Figure 1.1: Cost per Raw Megabase of DNA Sequence. The cost to determine one megabase(a million bases, Mb) to generate raw, unassembled sequence data. From 2001 through Oc-tober 2007 represent the costs of generating DNA sequence using Sanger-based chemistriesand capillary-based instruments. Beginning in January 2008, the data represent the costs ofgenerating DNA sequence using next-generation sequencing (NGS) platforms. A hypotheticaldata reflecting Moore’s Law is used for comparison, since technology improvements that followsMoore’s Law projections are considered to be going very well. Data from Wetterstrand (2015).
1.1 Motivation
Mapping reads with hundred of base-pairs (bp) is a computationally expensive process, since
the reference sequences may have a size of millions bp, for instance the human genome has
over 3 million bp. Moreover, repetitive DNA sequences, which are common and abundant on
the genome of many species, leads to the mapping of a single read to multiple locations cre-
ating technical challenges that may result in errors and biases in downstream analysis. These
imprecise results may, in particular, lead to false inferences of single nucleotide polymorphisms
(SNPs) and copy number variants (CNVs) (Treangen and Salzberg, 2012).
2
1.2. Objectives
On the other hand, despite the undoubted impact of NGS technologies these platforms
produce a vast amount of data that imply storage solutions. Additionally, the fact that each
platform differs in features, such as read length, data format, and sequencing method, affects
the methodologies that should be employed in their analysis (Zhang et al., 2011).
Thus, many mappers — i.e., read mapping software — have been developed, which rely
on different approaches. However, many of these solutions have limitations related to scalabil-
ity — the time required to execute the mapping— when based on Burrows-Wheeler transform
method, or with memory footprint, if it is based on a hashing method (Lee et al., 2014; Hatem
et al., 2013). Furthermore, a mapping tool has to take into account coverage — the percentage
of reads that are effectively mapped — and precision — mapping reads to the correct location
in the reference genome — as to obtain the best depth — the number of reads covering a given
locus at the genome.
Due mostly to current technologies limitations, which generate reads with sequencing er-
rors, e.g., base miscalls, a major challenge is the ability to distinguish between technical errors
and biological variations that occur at the sequenced sample. Hence, if every read is mapped
and each of them is correctly mapped to a location, we will have more certainty in the consensus
sequence of the sample, which is of extreme importance when detecting genetic variants (like,
SNPs or single nucleotide variations (SNVs)) from the reference genome (O’Rawe et al., 2013).
1.2 Objectives
The work presented in this thesis aims in developing a mapping tool with high coverage and pre-
cision by exploring and enhancing hash-based approaches to increase the search space over the
reference genome with the generation of multiple keys for each read, taking into account qual-
ity information and/or biological constraints. Therefore, despite existing sequencing errors, a
read will generate several keys, which directly translate to multiple locations in the reference
genome to be searched, significantly improving the chances of finding the right localisation.
The mapping will be definitive after a positive alignment of the read with the genome, using
algorithms based in the Needleman-Wunsch method (Needleman and Wunsch, 1970). This
3
1. Introduction
strategy however aims at finding the right balance between the number of locations that are
effectively searched and the precision and coverage achieved by the mapping solution.
Another goal of this work is to develop an user-friendly tool, where the best results are ob-
tained with less parameters being required to be set by the user, while being sequencing tech-
nology platform independent.
1.3 Contributions and Results
The approach implemented, coded in Java, takes advantage of modular programming to enable,
in a simple way, the user to plug different algorithms responsible for generating the keys used in
the sequencing process, as well as the read alignment strategy. This allows to study the influence
of such algorithms in the mapping and to further extend the tool with additional algorithms to
pursue the best combination of the keys generation and read alignment algorithms. The source
code is available in a public repository1.
At the end, this work contribute with:
• A read mapping pipeline in which the reference genome is hashed and multiple search
keys for each read are used to find the genomic candidate locations;
• Four algorithms to generate multiple search keys for one read, that take into account the
Phred quality values and/or nitrogenous base similarity;
• A simple implementation of the Needleman-Wunsch method (Needleman and Wunsch,
1970) to align two nucleotide sequences, and other where a similarity matrix is used to
score the matchings between the sequences;
• A Java version of the sliding window algorithm to retrieve search keys from a read and
the variant of the Needleman-Wunsch method implemented at GNUMAP (Clement et al.,
2010);
• A tool that allows to combine these different algorithms, to generate keys and to align a
read to a reference sequence.
1https://github.com/NatachaPL/LLC-Read-Mapping-Pipeline.git
4
1.4. Overview
An evaluation of the prototype was made with simulated datasets and we observed that:
• It did not succeed regarding scalability, although the GNUMAP-based alignment algorithm
required less time to perform no matter which method we resort to generate the search
keys;
• As expected, all the combinations led to a 100% coverage of the datasets, i.e, every single
read was mapped to the reference sequence;
• By combining the GNUMAP-based algorithms we obtained a 100% precision, the datasets
were entirely mapped to the original positions within the reference sequence, and we had
more reads finding their best position, about 94%.
• However, the implementation of the Needleman-Wunsch method enriched with a simil-
arity matrix led to more reads discovering other possible locations, reflected in a 7% in-
crease. This is a particularly interesting result when dealing with real datasets because
of the repetitive DNA sequences within the genome, but it also has advantages in finding
SNPs and SNVs.
Real reads of Escherichia coli UTI89 were mapped to its genome sequence, allowing to
confirm the observations about scalability. Despite the results with simulated data, the proto-
type only was able to find the best and other possible locations using the GNUMAP-based al-
gorithm to create keys, getting better results when combined with the version of the Needleman-
Wunsch method enriched with a similarity matrix, inferring from the 28% and 11% of reads
obtained respectively for each task. Also, by assigning more than one search key to a read we
improved the coverage up to 100%.
1.4 Overview
The next chapter introduces the historical path which ultimately brought us to this challenge.
Afterwards, we describe the NGS methodology and we refer to the main features of the most
commonly used platforms, as well as their errors and biases. The FASTQ file format, that plays
a significant role in our approach, is also presented. To conclude Chapter 2, we provide a brief
5
1. Introduction
review of some mapping tools currently available and cloud computing solutions that may be
used to improve our tool scalability. In Chapter 3, our approach is explained. First, we describe
its architecture and then the major implementation details are presented — where we include
the search key generation algorithms developed and the alignment methods supported by our
pipeline. The evaluation of the prototype is discussed in Chapter 4, in which we also explain
how the simulated FASTQ files were obtained and present the results for the mapping of real
reads from E. coli UTI89. The final chapter provides conclusions and considerations for future
developments and discusses the aspects that we think have to be taken into account for further
improvements to our tool.
6
Chapter 2
Related Work
In this chapter, a historical background is presented to introduce some fundamental concepts of
genomics, the early beginnings of genome sequencing and the Human Genome Project (HGP)
and its consequences. Then, differences between the most widely used Next-Generation Se-
quencing (NGS) technologies are denoted, as their inherent characteristics also poses as a chal-
lenge to mapping reads, namely the sequencing errors; the FASTQ file format, which the tool
presented in this thesis uses as input, is presented as well. Additionally, this chapter focus on
mapping reads, by presenting a brief survey of algorithms for NGS data. And, finally, cloud com-
puting solutions are referred since they can be used to improve the scalability of our prototype.
2.1 From DNA Discovery to the Human Genome Sequence
For centuries, farming techniques have been used to breed crops and animals with particu-
lar traits; however, it was only in the 19th century that Gregor Mendel published the results of
his investigation with peas and described how living organisms passed traits to their offspring
(Mendel, 1866).
In early 1869, Friedrich Miescher, while trying to understand the chemical basis of life,
discovered a new class of biological molecules in purified nuclei and called it “Nuclein” (Dahm,
2010). Six decades later, Phoebus Levene described the different types of nucleic acids, ribonuc-
7
2. Related Work
leic acid (RNA) and deoxyribonucleic acid (DNA), and defined the DNA as a sequence of units –
nucleotides – composed by a phosphate group, a deoxyribose sugar and one of four nitrogenous
bases: adenine, thymine, cytosine, and guanine (Levene and London, 1929). Meanwhile, Fred-
erick Griffith, in experiments with Streptococcus pneumoniae, determines that there must be a
genetic factor which can transform the bacteria (Griffith, 1928); this "transforming factor" was
demonstrated by Oswald Avery and his colleagues as being the DNA (Avery et al., 1944). Finally,
Hershey and Chase, confirmed the DNA as the genetic material responsible for the heredity of
traits (Hershey and Chase, 1952).
From 1948 to 1952, Erwin Chargaff published a series of papers in which he concluded that
there is an adenine for every thymine, and a cytosine for every guanine in every living organ-
ism (Cohen and Portugal, 1974). These findings contributed to the DNA structure proposed by
James Watson and Francis Crick, based in Maurice Wilkins and Rosalind Franklin DNA crystals
image via X-ray (Watson and Crick, 1953a,b): the DNA is a double-helix where two helical chains
are hydrogen-bonded by complementary base pairs — adenine with thymine and cytosine with
guanine (Figure 2.1). Wilkins, Watson and Crick received the Nobel Prize in Physiology in 1962
for this discovery (Nobelprize.org, 2015b). No other macromolecule in history has had its im-
age so widespread in our society, it even received the title “The Mona Lisa of Modern Science”
(Kemp, 2003).
Figure 2.1: Pairing of nucleotide bases. Hydrogen bonds are shown dotted. Adapted from thework by Watson and Crick (1953b).
Then, in 1958, Francis Crick declared the "Central dogma of molecular biology" to explain
the transference of the information contained in DNA to proteins (Crick et al., 1970); three years
8
2.1. From DNA Discovery to the Human Genome Sequence
later, he and his colleagues published their genetic experiments that, together with other works,
allowed them to observe the degeneracy of the genetic code, which means that if an amino-
acid is coded by a triplet (a group of three nucleotides), there are 64 possibilities to represent
20 amino-acids (Crick et al., 1961). Afterwards, Nirenberg’s team was able to relate 45 out of 64
triplets with the respective amino-acid and predict the remaining nucleotide sequences (Niren-
berg et al., 1965).
An important step in "reading" the content of a DNA sequence was made by Frederick
Sanger and colleagues when they determined the DNA sequence for the genome of bacterio-
phageΦX174 (Sanger et al., 1977a). Soon after, Allan Maxam and Walter Gilbert reported an ap-
proach to sequence DNA wherein terminally labelled DNA fragments were subjected to chem-
ical cleavage specific to each base and the reaction products resolved by polyacrylamide gel
electrophoresis (Maxam and Gilbert, 1977). Yet, in the same year, Sanger describes a new se-
quencing method, applied to the genome of bacteriophage ΦX174, using DNA polymerase and
chain-terminating dideoxynucleotide analogs, thus causing base-specific termination of a newly
synthesised chain (Sanger et al., 1977b). This method revealed itself as less laborious than
Maxam’s. Both techniques led to half of the 1980’s Nobel Prize in Chemistry being jointly awar-
ded to Sanger and Gilbert "for their contributions concerning the determination of base se-
quences in nucleic acids" (Nobelprize.org, 2015a) . With the cost of computer components be-
ginning to rapidly fall, allowing laboratories to have their own computers, and DNA sequencing
becoming a faster procedure, computer programs arose as a solution to handle and analyse data
produced by sequencing experiments (Staden, 1979). Sanger’s method was adapted to ’shotgun’
sequencing, in which the DNA sequence assembly of overlapped smaller sub-sequences is per-
formed by computer software (Anderson, 1981). Further improvements on the Sanger sequen-
cing technique led to the adoption of fluorescent dyes enabling a computer-based automatic
base identification (Smith et al., 1986; Prober et al., 1987). Another example of the aid of inform-
atics in fields of biology, at this time, is the FASTA program for protein and DNA sequence simil-
arity analysis and databases search (Pearson and Lipman, 1988); nowadays, FASTA is known as
the default text-based format for biological sequences.
When Robert Sinsheimer, then chancellor of the University of California in Santa Cruz,
9
2. Related Work
proposed the possibility of sequencing the human genome in 1985, many thought his idea was
premature or even crazy, due to the demand of resources; however, in 1986, Charles DeLisi of
the U.S. Department of Energy (DOE) decided to fund research for genome sequencing and
mapping. Two years later, a special committee of the U.S. National Research Council of the U.S.
National Academy of Sciences recommended the Human Genome Project to be initiated, with
a deadline of 15 years and funding of about $200 million a year. In 1990, with James Watson
leading the National Institutes of Health (NIH) part of the now joint NIH-DOE project (Collins
et al., 2003), the Human Genome Project (HGP) started (Watson, 1990). This was the first large-
scale biology project, one that changed biology and the biomedical sciences, an international
endeavour that counted with the Sanger Centre (funded by the Wellcome Trust) and assisted by
the private sector. The HGP promoted the development of new sequencing technologies, with
its need for high-throughput generation of biological data at low cost — which was boosted by
the advent of capillary sequencing machines. Its research about the legal, ethic, and social im-
pact of the knowledge being gathered and the collection of increasing biological data to be ana-
lysed, annotated, and stored, but made publicly accessible in user-friendly databases, created a
clear need for interdisciplinary in genomics research.
Although the human genome was the flagship of the project, it also assembled the ge-
nomic sequences for the E. coli, S. cerevisae, C. elegans, D. melanogaster, and whole-genome
drafts of several others, including the mouse and the rat, which opened the door to Comparat-
ive Genomics (Collins et al., 1998, 2003). Thus, by February 2001, when the International Human
Genome Sequencing Consortium (Lander et al., 2001) and Celera Genomics (a private project
started in 1998) (Venter et al., 2001) reported the first draft of the human genome, the landscape
of biological and biomedical research had already started to change. The HGP successfully
ended 2-years earlier than initially planned (Collins et al., 2003), just in time to celebrate 50th
Anniversary of the discovery of the DNA structure; the following year was marked with almost
99% of the euchromatic genome highly accurately sequenced (Collins et al., 2004). Nevertheless,
the understanding of the information encoded in the human genome was very limited, which
lead to the launch of the Encyclopedia of DNA Elements (ENCODE) Project (Encodeproject.org,
2015) in September of 2003, in which an international consortium, organized by the National
Human Genome Research Institute (NHGRI), received the task of identifying all the functional
10
2.2. Next-Generation Sequencing (NGS) Technologies
elements encoded in the human genome sequence. There is still much to understand, however,
the results of the ENCODE project combined with other large genomic data sets may elucidate
the genetic and epigenetic factors responsible for the development and progression of human
diseases (Frazer, 2012), for example.
Since the Sanger sequencing method continues to be expensive despite having been heav-
ily refined and improved, the NHGRI initiated “Advanced Sequencing Technology Development
Projects” in 2004, to motivate the development of low cost sequencing which led to the next-
generation sequencing (NGS) technologies to start to become available. Although these high-
throughput technologies produce shorter reads, i.e., DNA fragments synthesised, when com-
pared to the Sanger method, their parallelised sequencing process produces thousands of bases
per second at significantly reduced cost (Pettersson et al., 2009). NGS technologies are im-
proving biomedical investigation with clinical implications, such as cancer treatment (Ross and
Cronin, 2011; Bohlander, 2013; Offit, 2014) and infectious disease management (Pak and Kasars-
kis, 2015), while being widely used in many biological fields. The promise of the HGP for biology,
biomedical research and health care of change (Collins et al., 1998) is fulfilled with more to come
(Green et al., 2011).
"The ever quickening advances of science made possible by the success of the Human Genome
Project will also soon let us see the essences of mental disease. Only after we understand them at
the genetic level can we rationally seek out appropriate therapies for such illnesses as
schizophrenia and bipolar disease."
- James D. Watson (The New York Times, 2007)
2.2 Next-Generation Sequencing (NGS) Technologies
The automated Sanger method is considered a ’first-generation’ technology, in which the DNA
to be sequenced can be prepared by being randomly fragmented — sequencing library — and
then cloned to a plasmid vector and used to transform E. coli — for shotgun de novo sequen-
cing — or for PCR (Polymerase Chain Reaction) amplification carried out with primers that flank
11
2. Related Work
Figure 2.2: Work flow of Sanger sequencing method (a) versus second-generation sequencing(b). Adapted from the paper by Shendure and Ji (2008).
the target — for targeted resequencing. Both approaches output an amplified template: clonal
copies of the single plasmid insert within the bacterial colony (as depicted in Figure 2.2 (a)) or
PCR amplicons within a single reaction volume. The sequencing biochemistry takes place in
a ‘cycle sequencing’ reaction, within a microliter-scale volume, generating a ladder of ddNTP-
terminated, dye-labelled products, which are subjected to high-resolution electrophoretic sep-
aration of the single-stranded, end-labeled extension products in a capillary-based polymer gel;
finally, as fluorescently labelled fragments of discrete sizes pass a detector, the four-channel
emission spectrum is used to generate a sequencing trace and a software translates these traces
into DNA sequence, while generating error probabilities for each called base (Shendure and Ji,
12
2.2. Next-Generation Sequencing (NGS) Technologies
2008).
’Second-generation’ technologies is a term used to refer multiple implementations of ’cyclic-
array sequencing’ and, although these approaches differ in biochemistry and array generation,
their work flows are conceptually similar (Figure 2.2 (b)). In comparison to Sanger sequencing,
these new technologies have the advantages of in vitro construction of a sequencing library, fol-
lowed by in vitro clonal amplification to generate sequencing features. Also, the array-based
sequencing enables a much higher degree of parallelism than conventional capillary-based se-
quencing; and, its features are immobilized to a planar surface which means they can be en-
zymatically manipulated by a single reagent volume, leading to a drop of the effective reagent
volume (Shendure and Ji, 2008). These differences combined results in a cheap production of
an enormous volume of data with shorter reads.
2.2.1 Comparison Between NGS Platforms
Although there are a few commercially available platforms, Illumina, Roche 454 Sequencing,
and Applied Biosystems SOLiD dominated the market (Zhang et al., 2011), being responsible for
a vast amount of data produced by NGS technologies. Nowadays, Illumina stands out in the NGS
industry and Roche announced the close down of its 454 operations in mid-2016 (McPherson,
2014). The reviews by Shendure and Ji (2008); Metzker (2010), and Liu et al. (2012b) explain the
details inherent to each sequencing method. The following discusses the fundamental aspects
of these to support a comparison between the three platforms 1:
• Illumina (Illumina, Inc., 2015) platforms rely on bridge PCR amplification to form clusters
with clonal DNA fragments; these fragments have free ends to which a universal sequen-
cing primer can be hybridised to initiate the sequencing reaction. Sequencing by syn-
thesis is the method adopted, wherein DNA synthesis is terminated by reversible termin-
ators following the incorporation of one of four modified nucleotides — each bearing one
of four fluorescent labels — by DNA polymerase. With sequencer options adapted to key
applications, Illumina systems have an output range from 20-39 Gb to 1.6-1.8 Tb with a
1To compare the sequencers’ output and read lengths the follow metric is used: 1 base-pair (bp); 1 000 000 bases= 1 mega base (Mb); 1 000 000 000 bases = 1 giga base (Gb).
13
2. Related Work
run time that can go from 15 to 40 hours or 1 to 6 days. Currently, the maximum read
length ranges between 2 x 125 and 2 x 150 bp depending upon on the Illumina model
employed.
• Roche 454 Sequencing (Roche Diagnostics Corporation, 2015) platforms use single stran-
ded DNA fragments that are captured by beads and emulsion PCR for clonal amplification.
The beads are deposited into individual wells where the sequencing is performed by the
pyrosequencing method; here, released pyrophosphate equals the amount of incorpor-
ated nucleotide which promotes a chemical reaction that generates visible light. Now, GS
FLX+ System can be used with two sequencing kits: one produces reads with lengths up to
1000 bp, with a typical throughput of 700 Mb within 23 hours, and the other has a typical
throughput of 450 Mb with 10 hours of run time and a read length that can go to up to 600
bp.
• Applied Biosystems SOLiD (Sequencing by Oligo Ligation Detection) (Thermo Fisher Sci-
entific Inc., 2015) sequencers also rely on emulsion PCR and adopted the technology of
two-base sequencing based on sequencing by ligation, an approach in which DNA poly-
merase is replaced by DNA ligase, as each sequencing cycle introduces a partially degener-
ate population of fluorescently labeled octamers. However, the 5500 W Genetic Analyzer
sequencer replaced the beads with direct amplification on FlowChip; depending on the
library used, read length can be 75 bp (fragment), 2 x 50 bp (mate-paired) and 50 bp x 50
bp (paired-end) with a of the throughput approximately 80 Gb to 160 Gb.
Targeted to clinical applications and small labs, Ion Torrent Systems (later acquired by
Life Technologies) launched the Personal Genome Machine (PGM), wherein DNA fragments
with specific adapter sequences are linked to surface beads (known as Ion Spheres Particles)
and then clonally amplified by emulsion PCR; proton release signals the incorporation of nuc-
leotides during synthesis. For the same market, Illumina developed the MiSeq. These two plat-
forms are similar in terms of utility and ease of work flow, however PGM has a higher sequen-
cing error rate (Quail et al., 2012). Roche has a benchtop version of the 454 Sequencing System
as well: the GS Junior System.
14
2.2. Next-Generation Sequencing (NGS) Technologies
2.2.2 Errors and Biases
Although all the different approaches introduced rely on a complex interplay of chemistry, hard-
ware, and optical sensors, they differ in other mechanical details which affect the sequencing
types of errors and biases produced by each type of platform. On the end of each sequencing
pipeline is a software, that analyses the sensor data to predict the individual bases; this is re-
ferred to as base-calling.
Solexa/Illumina platforms have been reported to have increased the error rates along the
read, in which the G to T and A to C conversions are among the most frequent base substitution
errors (Dohm et al., 2008), and wrong base-calls are frequently preceded by base G, showing a
GC bias from these platforms (Bravo and Irizarry, 2010; Minoche et al., 2011). Incorrect predic-
tion of homopolymers — consecutive runs of the same base — length leads to insertion and
deletion errors associated with Roche 454 platform (Ledergerber and Dessimoz, 2011). Since
all bases of a homopolymer are included in a single cycle, its length has to be inferred from the
signal intensity, thus, quality scores do not provide a measure that a base at a given position is
correct, but merely indicate that homopolymer length has been called correctly (Dohm et al.,
2008). The Ion Torrent PGM sequencer also presents limitations in sequencing homopolymers,
leading to a large amount of indel errors, and a AT bias (Quail et al., 2012). Finally, SOLiD ma-
chines that implement the sequencing-by-ligation method are incapable of sequencing through
palindromic regions (Huang et al., 2012).
Softwares that aim at correcting errors, such as Fiona (Schulz et al., 2014), have emerged
as solutions to improve the downstream analysis (Yang et al., 2013).
2.2.3 Third Generation Sequencing Technologies
Second-generation sequencing technologies are commonly known as the next-generation, but
a third-generation has arisen with two main characteristics: PCR is not required before sequen-
cing, meaning shorter DNA preparation time for sequencing, and the signal is captured in real
time, i.e., the signal is monitored during the enzymatic reaction of adding nucleotides in the
15
2. Related Work
complementary strand. The single-molecule real-time (SMRT) method, developed by Pacific
Bioscience and the Nanopore are approaches that belong to this new generation of sequencing
technologies (Liu et al., 2012b).
2.2.4 The FASTQ File Format
The sequencing technologies, such as Illumina and 454, produce a text-based output in which
the DNA fragments — i.e., reads — are represented by sequences with the letters A, C, G, T and
N; the first four letters represent nucleotide bases that can be present in a genome (Adenine,
Cytosine, Guanine, and Thymine respectively), and, since the sequencing reading process is not
perfect, in some cases the sequencer prefers to return a “not known” signal — hence the letter
N — instead of returning an incorrect value. These reads are known as being base or letter space
to distinguish from the colour space reads produced by SOLiD platforms.
@HWI-ST745_0097:7:1101:1005:1000#0/1
TTCTTCATACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATTACAAGATAAAAGTCT
+HWI-ST745_0097:7:1101:1005:1000#0/1
<D=<D===<C<=<<=<EA.=<C<=B:<=<===<<C<=C==B;<=<=;=C=FC5';FB5!
@HWI-ST745_0097:7:1101:1006:1000#0/1
CGCGCCAGAATGAAAAACAGAGTTCAAATTTTAAATGGACTACATCCAATGTTAAATAT
+HWI-ST745_0097:7:1101:1006:1000#0/1
=>5C?+=862>6;=@7=C=;;8<=82=87:5C=<1FB4&=98C<<C<C=:<=::;EA3<
@HWI-ST745_0097:7:1101:1007:1000#0/1
AAATGGACTACATCCAATGTTAAATATAAAAAACAAAAAGATGTAAATTTTACTGTCAC
+HWI-ST745_0097:7:1101:1007:1000#0/1
<=<<=<<=<B:<=EA.<B:=C<<==<=<=<<=<<;B;===B;B:=B:B;<<==B:=<=D
Figure 2.3: Extract from a file in FASTQ format. File produced by the ArtificialFastqGenerator(Frampton and Houlston, 2012).
The FASTQ file (Figure 2.3) format is the de facto common format for sequencing data. It
provides a simple extension of the FASTA format, which is the ability to store a numeric score
associated with each nucleotide base in a sequence. Thus, a FASTQ file consists of three different
sub-sources: the headers (identifiers), sequence bases, and quality scores. The quality score for
16
2.2. Next-Generation Sequencing (NGS) Technologies
a base called is defined in terms of the estimated probability of error (Pe ):
QPhr ed =−10× log10(Pe )
Phred scores are the de facto standard representation for sequence base qualities. In the
FASTQ format Phred qualities, whose value range from 0 to 93, are encoded in ASCII charac-
ters with codes between 33 and 126 (corresponding to printable characters), which gives a very
broad range of error probabilities from 1.0 (a wrong base) to 10−9.3 (an extremely accurate base)
(Cock et al., 2010).
As illustrated in Figure 2.3, a FASTQ file format represents each read with four lines, where:
1. a first line started with the ’@’ character followed by record identifier and additional in-
formation (such as, length or paired-end read information). Similar to the header of a
FASTA file format, it is a free format field with no length limit or format restriction;
2. the second line holds the nucleotide base sequence, without white spaces, and the use of
upper case is conventional (although not mandatory);
3. the third line begins with character ’+’ and is optionally followed by the header from line
1, because it only serves to signal the end of the sequence and the start of the next line;
4. the fourth and last line contains the ASCII-encoded quality scores and it must contain as
many symbols as the number of letters in line 2.
Because of its simplicity, FASTQ has become widely used as a simple interchange file
format between tools. Solexa/Illumina has created its own versions of the FASTQ format wherein
a different range for Phred scores are used (Cock et al., 2010); however, the different formats
can be easily converted among them using Open Bioinformatic Foundation (O|B|F, 2015) tools
(BioJava, 2015; Biopython, 2015; BioRuby, 2015; BioPerl, 2015; EMBOSS, 2015). On another note,
next-generation sequence reads are typically available online at the Sequence Read Archive
(SRA) which already has tools to convert the available data to FASTQ format (NCBI, 2015).
17
2. Related Work
2.3 Survey of Read Mapping Algorithms
The emergence of NGS platforms enabled the production of billions of short-reads with their
massively parallelised sequencing methods. On the other hand, the Human Genome Project
established reference sequences for the human genome and some model organisms, such as
E. coli, S. cerevisae, mouse and rat (Collins et al., 1998, 2003) enabling the resequencing using
short-reads. Hence, NGS technologies have allowed to broaden the applicability spectrum of
genomic sequencing, being finding the true location of a read within a genome a crucial step
in many projects and its result will affect the downstream analysis. Today, an investigator has a
fairly number of mappers — i.e., software to map the reads against a reference genome — avail-
able, which go from the popular ones, like Bowtie (Langmead et al., 2009b), with the advantage
of being widely used and constantly updated 2, or a recent one that aimed to outperform the
existing tools with a new approach, such as Arioc (Wilton et al., 2015). For instance, the works of
Holtgrewe et al. (2011) and Hatem et al. (2013) aims to help the user to choose the best tool for
his needs.
The mapping process, i.e., aligning a read to a reference genome and find its true location,
from the informatics point of view is a string matching problem. Algorithms to match strings
have been proposed far before the advent of the NGS technologies (Baeza-Yates and Perleberg,
1992); however, although reads and genomes are simple strings constructed by the letters A, C,
G, T and N, the challenge lies in distinguishing between technical sequencing errors and ge-
netic variation within the sample. Thus, read mapping becomes an approximate string problem
where the search for the read within the reference genome must allow some mismatches and
gaps between the two (Reinert et al., 2015), while at the same time managing efficiently large
amounts of data as well as a large search space in the form of a wide reference sequence. The
advances in sequencing technology have stimulated the software development with many ap-
proaches arising from the beginning (Li and Homer, 2010; Fonseca et al., 2012). However, most
of the fast alignment algorithms build auxiliary data structures — the indices — for the reads
or the reference sequence to find the genomic positions for each read, and we can group the
mapping tools based on the method used to build the index: hash tables or Burrows-Wheeler
2Bowtie: http://bowtie-bio.sourceforge.net/index.shtml
18
2.3. Survey of Read Mapping Algorithms
Transform (BWT) (Burrows and Wheeler, 1994).
2.3.1 Algorithms based on Hash Tables
All hash table based algorithms essentially follow the same ’seed and extend’ paradigm stated
by BLAST (Altschul et al., 1990). This method allows BLAST to find similar sequences not by
comparing either sequence entirely in its whole, but rather by locating short matches between
two sequences — the seeds. After this first match it extends and joins the seeds first without
gaps and then refines them by an improved Smith–Waterman alignment (Smith and Waterman,
1981; Gotoh, 1982). Finally, it outputs the statistically significant local alignments as the final
results. However, the algorithms that are relevant to our work focus on mapping a set of short
query sequences — the reads — against a long reference genome of the same species. And, for
these the spaced seed — a seed that allows internal mismatches and the number of matches is
its weight — a popular approach (Li and Homer, 2010). The detection of seeds usually follows
one of the two methods: index the reads and scan through the reference genome or index the
reference genome and align each read.
Index the reads and scan through the reference genome:
• MAQ (Li et al., 2008a), which uses the sequencing quality scores at the mapping, splits
the reads to create adaptive seeds; to speed up the alignment, it only considers positions
that have two or fewer mismatches in the first 28 bp (default parameters). MAQ relies on
an ungapped alignment, but for the small fraction of unmapped reads it will apply the
Smith-Waterman gapped alignment (Smith and Waterman, 1981).
• RMAP (Smith et al., 2008, 2009), also introduced the quality scores at the mapping, but it
creates his spaced seeds using the pigeonhole principle (Baeza-Yates and Perleberg, 1992)
— the reads are cut in k+1 pieces allowing at most for k mismatches in a mapping, which
means any mapping must have at least one seed with no mismatches. RMAP does not
consider insertions or deletions (indels), so its strategy for handling indels is to extend
initial seed matches using a Smith-Waterman-style alignment. SeqMap (Jiang and Wong,
19
2. Related Work
2008), follows the same pigeonhole principle to hash the reads, and since it splits the reads
and/or the genome into several parts, can be used in parallel on large scale data sets to
speed up the mapping process.
• RazerS (Weese et al., 2009) arose as a solution based on q-gram counting strategy, allowing
for gaps within read subsequences of a size q — the index keys — and search for multiple
matches before the extension step. RazerS 3 (Weese et al., 2012) is an improved RazerS
able to map longer reads; it supports shared-memory parallelism and adds a second read
index based on the pigeonhole principle. To extend the matches, they rely on Hamming
distance (Hamming, 1950) and on the edit distance algorithm from Hyyrö (2003).
• SHRiMP (Rumble et al., 2009) introduces a specialized algorithm for mapping colour space
reads from SOLiD sequencers, but it maps base space reads from Illumina/Solexa. It also
relies on the q-gram counting strategy to find matches between the reads and the genome,
which are extended using the local alignment algorithm by Smith-Waterman implemen-
ted using specialized “vector” instructions that are part of the CPU instruction sets and,
hence, are efficient.
Index the reference genome and align each read:
• SOAP (Li et al., 2008b), specifically designed for detecting and genotyping single nucle-
otide polymorphisms (SNPs), manages great amounts of NGS data by supporting multi-
threaded parallel computing and records the reference sequence and hash index tables
in memory. The GNUMAP (Clement et al., 2010) algorithm incorporates the base quality
scores into mapping analysis using a probabilistic variant of the Needleman-Wunsch al-
gorithm (Needleman and Wunsch, 1970) to accurately map reads with lower confidence
values; this tool creates overlapping contiguous k-mers — k-sized sequences — from the
genome sequence to build the index and splits the reads into a set of overlapping k-mers
to look up the index. Both tools were first designed for Illumina/Solexa data, but receive a
FASTQ file as input.
• SHRiMP2 (David et al., 2011) is an updated version of SHRiMP, that switched to a genome
index resulting in a dramatic speed increase and allowed to utilize multithreaded compu-
20
2.3. Survey of Read Mapping Algorithms
tation. Also to speed up the alignment, before starting the Smith-Waterman algorithm,
SHRiMP2 checks if an identical region has already been aligned to reuse the score. This
version supports Illumina/Solexa, Roche/454 and AB/SOLiD reads.
• mrFAST (Alkan et al., 2009) and mrsFAST (Hach et al., 2010) are both developed by lever-
ing the same method, which creates a collision free hash table to index k-mers from the
genome, interrogate the first, middle and last k-mers of each read in the hash table to
place initial ungapped seeds and extends the seeds with a rapid version of the edit distance
(Levenshtein, 1966); however, the former supports gaps and mismatches while the latter
supports only mismatches as to lower its execution time. mrsFAST-Ultra (Hach et al.,
2014) improves the method of mrsFAST by compacting the index and adding parallelisa-
tion and SNP-awareness features.
• Hobbes (Ahmadi et al., 2012) is based on the generating of overlapping substrings of length
q — q-grams — of the reference sequence, and constructs an inverted index of those q-
gram positions. The extension of the seeds passes through a Hamming distance (Ham-
ming, 1950) and an implementation of the edit distance by Myers (1999). Hobbes2 (Kim
et al., 2014) is built on top of Hobbes, improving its performance in all aspects and scaling
well in a multithreaded environment. The update included an additional prefix q-gram
instead of bit vectors, reducing the memory consumption.
• MOSAIK (Lee et al., 2014) is a tool with the ability to map data from all major ’second’ and
’third’ sequencing technologies, that relies on an improved Smith-Waterman algorithm
(Gotoh, 1982) to align a read to a local region of the genome. MOSAIK creates overlapping
contiguous k-mers from the genome sequence to build a hash table. The reads are split
into a set of overlapping k-mers to query the stored reference hash table and retrieve the
genomic positions of each k-mer; a modified AVL tree (AdelsonVelskii and Landis, 1963)
is employed to handle and cluster the nearby positions to form a k-mer region.
• Adaptive seeds are an alternative to fixed-length seeds, such as the spaced seeds, as they
have their length extended until the number of matches in the target sequence is less than
or equal to a frequency threshold. First proposed by Kiełbasa et al. (2011), in a BLAST
variation, the adaptive seeds are used by AMAS (Hieu Tran and Chen, 2015) to speed up
21
2. Related Work
the mapping process while preserving sensitivity and identifying all possible locations for
each read being mapped.
Recent approaches, adapted the ’seed and extend’ method to parallel approaches based
on specific hardware, like field programmable gate arrays (FPGA) (Chen et al., 2013) or graphical
processing unit (GPU) (e.g, Masher (Abu-Doleh et al., 2013) and Arioc (Wilton et al., 2015)).
2.3.2 Algorithms based on Burrows-Wheeler Transform (BWT)
The Burrows-Wheeler Transform (BWT) is a data compression algorithm (Burrows and Wheeler,
1994) that was combined with a suffix array (Manber and Myers, 1993) — a sorted array of all suf-
fixes of a string — to create the FM-index (Ferragina and Manzini, 2000). Algorithms that trans-
form the genome into a FM-index reducing the inexact matching problem to an exact matching
one: they find exact matches with the index and then create inexact alignments supported by
exact matches. An advantage of this approach is that alignment to multiple identical copies of
a subsequence in the reference is only needed to be done once, whereas with a typical hash
table index an alignment must be performed for each copy. Moreover, finding exact matches
using backwards search on a FM-index can be done in a constant time (Li and Homer, 2010).
However, despite the improvements in performance and its small memory footprint, building
a FM-index significantly takes longer than building a hash table index (which in turn requires
a large memory to index wide genomes, like the human genome) (Fonseca et al., 2012; Hatem
et al., 2013; Lee et al., 2014).
Popular BWT-based aligners are:
• Bowtie (Langmead et al., 2009b), which creates indices small enough to be distributed
over the internet and easily accessible. Bowtie does not simply adopt the exact matching
algorithm to search the FM-index, because exact matching does not allow for sequencing
errors or genetic variations. So, it introduces a quality-aware backtracking algorithm that
allows mismatches and favours high-quality alignments. It employs a ’double indexing’,
a strategy to avoid excessive backtracking. Bowtie 2 (Langmead and Salzberg, 2012) ex-
tends the method applied in Bowtie to allow gapped alignment by dividing the algorithm
22
2.3. Survey of Read Mapping Algorithms
between an ungapped seed-finding stage and a gapped extension stage, that uses dynamic
programming. Bowtie 2 relies on the efficiency of single-instruction multiple-data (SIMD)
parallel processing to accelerate the dynamic programming.
• Burrows-Wheeler Alignment tool (BWA) (Li and Durbin, 2009) emerged with an algorithm
similar to Bowtie, but with a lower search space and adapted to map base space reads,
e.g., from Illumina sequencers, and colour space reads from SOLiD machines. BWA-SW
(Li and Durbin, 2010) adds a Smith-Waterman-like dynamic programming mechanism to
BWA, so it can align long sequences up to 1000 base-pairs against a large sequence data-
base with a few gigabytes of memory. In a way, BWA-SW follows the ’seed and extend’
paradigm by finding seeds between two FM-indices, relying on dynamic programming,
and it extends a seed when it has few occurrences in the reference sequence; the seed is
allowed to have mismatches and gaps in the seeds. BWA-MEM (Li, 2013) is implemented
as a component of BWA, it also follows the ’seed and extend’ paradigm, however, it initially
seeds an alignment with supermaximal exact matches using an algorithm from Li (2012),
which essentially finds at each query position the longest exact match covering the pos-
ition. While extending a seed, BWA-MEM tries to keep track of the best extension score
reaching the end of the query sequence, as a strategy to automatically choose between
local and end-to-end alignment.
• SOAP2 (Li et al., 2009b) is an improvement of SOAP (Li et al., 2008b) where the BWT com-
pressed index is used instead of the seed algorithm for indexing the reference sequence
in the main memory; a hash table is built to accelerate searching the location of a read in
the BWT reference index and determine an exact match. SOAP3 (Liu et al., 2012a) is an
optimised version of SOAP2, that achieves a significant improvement in speed by adapt-
ing the BWT index to the graphic processing unit (GPU). SOAP3-dp (Luo et al., 2013) is the
enhanced version of SOAP3 that takes advantage of the GPU-based approach to perform
dynamic programming for aligning a read with a candidate region in genome, a modified
Smith-Waterman algorithm is implemented, and report alignments with indels and gaps.
• CUSHAW (Liu et al., 2012c) exploits the compute unified device architecture (CUDA) to
parallelise and accelerate an algorithm based on BWT that resorts to a FM-index. At the
23
2. Related Work
time of the article publication, CUSHAW did not allow insertions and deletions; thus, the
search for inexact matches was transformed to the search for exact matches of all per-
mutations of each possible bases at every position of a short read. By default, CUSHAW
supports a maximal read length of 128 (can be configured up to 256). CUSHAW2 (Liu and
Schmidt, 2012) follows the ’seed and extend’ approach, using memory efficient versions of
BWT and FM-index to generate seeds for each read; these seeds are based on maximal ex-
act matches (MEM) — exact matches that cannot be extended in either direction without
allowing a mismatch. CUSHAW2 aims to map longer reads, using the seeds to find gapped
alignments and by employing vectorization and multithreading to achieve fast execution
speed on standard multi-core CPUs. The Smith-Waterman algorithm is implemented to
compute the optimal local alignment scores. CUSHAW3 (Liu et al., 2014) supports both
base space and colour space reads, and it was developed to improve alignment sensitivity
and accuracy of CUSHAW2. It relies on a hybrid seeding approach to improve alignment
quality that creates MEM seeds based on BWT and FM-index, exact match k-mer seeds,
and variable-length seeds at different phases of the alignment pipeline. However, the hy-
brid seeding approach improves the alignment sensitivity and accuracy at the cost of a
significant loss of processing speed.
• Masai (Siragusa et al., 2013), first constructs a conceptual suffix tree of the reference gen-
ome, stores it on disk and reuses it for each read mapping job; then, at the mapping time,
the strategy to create the seeds is chosen according to the reference genome and the spe-
cified error rate. Each seed reported by a multiple backtracking algorithm is extended at
both ends by a banded version of the Myers bit-vector algorithm (Myers, 1999) presented
in RazerS 3 (Weese et al., 2012).
2.3.3 Best-mapper vs All-mapper
A best-mapper prioritizes candidate locations, and returns one or a few best mapping locations
for each read, mainly to achieve an optimal combination of speed, accuracy, and memory ef-
ficiency; moreover, BWT-based algorithms, such as Bowtie (Langmead et al., 2009b), Bowtie 2
(Langmead and Salzberg, 2012), versions of BWA(Li and Durbin, 2009, 2010; Li, 2013) apply an
24
2.4. Genomics meets Cloud Computing
exact match search to achieve that optimal combination. The hash-table based, MAQ (Li et al.,
2008a) and SOAP (Li et al., 2008b) are also best-mappers. MAQ always reports a single align-
ment, choosing a best position randomly if a read can be aligned equally well to multiple posi-
tions; and, SOAP reports the best hit of each read which has minimal number of mismatches or
smaller gap. Therefore, in case of equal best hits, the user can instruct the program to report all
or randomly report one or disregard them all.
However, for some NGS applications an all-mapping task is essential, e.g. prediction of
genomic variants or protein binding motifs located in repeat regions isoform expression quan-
tification (Alkan et al., 2009; Hach et al., 2010; Newkirk et al., 2011). And, although best-mappers
may have an option to report all mappings, since their algorithms are based in finding a unique
search, they might not perform as well as mappers specialised in identifying as many as pos-
sible, if not all, matches within a reasonable time — all-mappers. Most all-mappers follow the
’seed and extend’ paradigm in which locations reported by the seeds of a read are used as can-
didates for extending the alignment to the rest of the read. Some well regarded all-mapping tools
are mrFAST (Alkan et al., 2009) and mrsFAST (Hach et al., 2010), RazerS 3 (Weese et al., 2012),
Hobbes (Ahmadi et al., 2012), Hobbes2 (Kim et al., 2014), Masai (Siragusa et al., 2013) and AMAS
(Hieu Tran and Chen, 2015). On the other hand, when requested, MOSAIK (Lee et al., 2014)
also outputs all possible mapping locations for every read in a separate output file, behaving
simultenously as a best-mapper and an all-mapper.
2.4 Genomics meets Cloud Computing
Handling big amounts of data is a challenge known to informatics brought by the Internet and
the natural technology evolution and massification. To deal with the massive grow in the num-
ber of websites and information available, in the Internet, Google developed the MapReduce
system to process huge quantities of data efficiently and in a timely manner. This programming
model and system allows work to be distributed among large numbers of servers and carried out
in parallel; soon after, an open source project that implements the Google MapReduce system
emerged: the ApacheTM Hadoop® framework (The Apache Software Foundation, 2015b). The
25
2. Related Work
parallel data processing system of MapReduce excels at exhaustive processing — e.g, executing
algorithms that must examine every single record in a file in order to compute a result (Olson,
2010).
Cloud computing provides a scalable and cost efficient solution to manage large amounts
of data. It relies on the a pay-per-use model that has on-demand network access to a shared
platform of configurable computing resources, e.g. servers, storage, and services, which can
be rapidly provisioned and released with minimal management effort or service provider inter-
action. So, when it became more expensive to store, process, and analyse genomic data than
generate it, genomic algorithms started to leverage Hadoop (O’Driscoll et al., 2013); starting
the development of solutions such as Crossbow (Langmead et al., 2009a), a pipeline for single
nucleotide polymorphisms (SNPs) calling, CloudAligner (Nguyen et al., 2011) for sequence map-
ping, BioPig (Nordberg et al., 2013), a toolkit for sequence analysis, and CloudDOE (Chung et al.,
2014), a software to deploy a Hadoop cloud specifically thought for bioinformatics applications.
Apache SparkTM (The Apache Software Foundation, 2015a) is another MapReduce-based
cluster computing framework, which supports applications with working sets. Spark can out-
perform Hadoop by 10x in iteractive machine learning jobs and can be used interactively to scan
a 39 GB dataset with sub-second latency (Zaharia et al., 2010). SparkSeq (Wiewiórka et al., 2014),
SparkSW (Zhao et al., 2015), that relies in the Smith-Waterman (SW) algorithm (Smith and Wa-
terman, 1981) to align the sequences, and eXpress-D (Roberts et al., 2013), which targets the
problem of reads mapped to multiple locations, are Spark-based tools for NGS data analysis.
Therefore, cloud computing and big data technologies have a future within biological sci-
ences and biomedical research, enabling users to rapidly interrogate the characteristically vast
datasets produced by NGS platforms. For instance, the work by Onsongo et al. (2014) demon-
strates how NGS data analysis paired with cloud computing can be safely and reasonably used
in a clinical molecular diagnostics laboratory.
26
Chapter 3
Read Mapping Pipeline
As discussed in the previous chapter, several genomic read mapping tools have been proposed
in the past that resort to different strategies. In this chapter, we introduce the architecture of
our proposed solution, which is based on a pipeline for read mapping. We start by providing
the high level view of the architecture and then we explain the most interesting implementation
details of our prototype, as well as our proposed algorithms for generating search keys and for
read alignment in relation to a reference genome. The source code for the implementation is
available in a public repository1.
Following the ’seed and extend’ strategy, our pipeline, first, creates an index from the ref-
erence sequence: a hash table in which a k-sized subsequence is the search key of each entry,
and the value is a list of the genomic position where the subsequence can be found. Then, for
each read a subsequence of size k is retrieved serving as a key to search within the reference
sequence. When a hit is found — a seed — the whole read is aligned with the genomic sequence
— it is extended. What we propose here is to expand the search space by assigning more than
one search key for each read, i.e., to increase the number of seeds, and we developed a few al-
gorithms to do it. The extend part has an algorithm of its own to perform the alignment between
a read and a genomic subsequence, which can also be changed.
1https://github.com/NatachaPL/LLC-Read-Mapping-Pipeline.git
27
3. Read Mapping Pipeline
3.1 Architecture
Our approach follows a modular programming paradigm, where each module is responsible for
one part of the pipeline, and allows to explore different combinations of the keys generation and
read alignment algorithms:
• Genome: reads a FASTA file with the reference sequence in text format to be hashed. The
hashed reference sequence is stored here, as well as the information required by some of
the following modules of the pipeline;
• Read: gives a functional sense to the information retrieved from a FASTQ file (Cock et al.,
2010) that corresponds to a single read, which is characterised by its identification header,
base sequence, i.e. the read itself, and the probabilistic values associated with the quality
of each base in the read;
• Probabilities: the quality scores for each base in the read are encoded in ASCII, here they
are converted to probabilistic values. It also calculates the probabilities values for remain-
ing other bases, meaning their probability of being the correct base in that position of the
read assuming that the one called is wrong.
Since one of our goals is to try different algorithms to create search keys from the reads and to
align them with the reference sequence (this step effectively controls the regions of the reference
genome that is inspected for each read), we also have the modules:
• Exploder — which will receive the algorithms to generate multiple search keys for a read;
• Aligner — to process the alignment between the Read base sequence and a substring of
the Genome using different implementations of the Needleman-Wunsch method (Needle-
man and Wunsch, 1970) .
Finally, we also consider two different ways to present the results — one based on a top
rank and another which is GNUMAP-based (Clement et al., 2010) — the Combiner module is
responsible for the interconnection between the components presented, being responsible for
managing the flow of the pipeline (Figure 3.1).
28
3.2. Implementation
@HWI-ST745_0098:8:191:1001:1000#0/1
AAAATTGAGATAAGAAAACATTTTACTTTTTCAAAATTGTTTTCATGCTAAATTCAA
AACGTTTTTTTTTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT
GCCTGCCAACCTATATGCTCCTGTGTTGCCAACCTATATGCTCCTGTGTTtaggcct
+HWI-ST745_0097:7:191:1001:1000#0/1
!;$"=<&<==<B>)=@;<<<<<A9<=<<=C==C<B;<=<<C<===B;=<<===<<FA
/=D@,<EA2;<<<<=====A:C<<==<<;==<D@,=<C==;@9===C==<D@D==D=
<D=89B;<=C=<<B;A9EA2"<<;=EB4%=<=<=FB0<==<<<;=<<:C==;HD7*<
@HWI-ST745_0097:7:850:1001:1000#0/1
ATATGGTAGCTACAGAAACGGTAGTACACTCTTCTGAAAATACAAAAAATTTGCAAT
TTTTATAGCTAGGGCACTTTTTGTCTGCCCAAATATAGGCAACCAAAAATAATTGCC
AAGTTTTTAATGATTTGTTGCATATTGAAAAAAACATTTTTCGGGTTTTTTGAAATG
+HWI-ST745_0097:7:850:1001:1000#0/1
!;$"=<&<==<B>)=@;<<<<<A9<=<<=C==C<B;<=<<C<===B;=<<===<<FA
/=D@,<EA2;<<<<=====A:C<<==<<;==<D@,=<C==;@9===C==<D@:D==D
=<D=89B;<=C=<<B;A9EA2"<<;=EB4%=<=<=FB0<==<<<;=<<:C==;HD7*
@HWI-ST745_0097:7:1000:1001:1000#0/1
TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACAGAAACGGTTGTGCACTCATCT
GAAAGTTTGTTTTTCTTGTTTTCTTGCACTTTGTGCAGAATTCTTGATTCTTGATTC
TTGCAGAAATTTGCAAGAAAATTCGCAAGAAATTTGTATTAAAAACTGTTCAAAATT
+HWI-ST745_0097:7:1000:1001:1000#0/1
!;$"=<&<==<B>)=@;<<<<<A9<=<<=C==C<B;<=<<C<===B;=<<===<<FA
/=D@,<EA2;<<<<=====A:C<<==<<;==<D@,=<C==9===C==<D@:D==D=<
D=89B;<=C=<<B;A9EA2"<<;=EB4%=<=<=FB0<==<<<;=<<:C==;HD7*<=
(...)
FASTQ file
>Random Species Genome_chrI_third
Gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagc
aagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa
gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagc
ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct
aagcctaagcctaagcctaagcctaagcctaaAAAATTGAGATAAGAAAA
CATTTTACTTTTTCAAAATTGTTTTCATGCTAAATTCAAAACGTTTTTTT
TTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCT
GCCAACCTATATGCTCCTGTGTTtaggcctaatactaagcctaagcctaa
gcctaatactaagcctaagcctaagactaagcctaatactaagcctaagc
ctaagactaagcctaagactaagcctaagactaagcctaatactaagcct
aagcctaagactaagcctaagcctaatactaagcctaagcctaagactaa
gcctaatactaagcctaagcctaagactaagcctaagactaagcctaaga
ctaagcctaatactaagcctaagcctaagactaagcctaagcctaaAAGA
ATATGGTAGCTACAGAAACGGTAGTACACTCTTCTGAAAATACAAAAAAT
TTGCAATTTTTATAGCTAGGGCACTTTTTGTCTGCCCAAATATAGGCAAC
CAAAAATAATTGCCAAGTTTTTAATGATTTGTTGCATATTGAAAAAAACA
TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACAGAAACGGTTGTGCA
CTCATCTGAAAGTTTGTTTTTCTTGTTTTCTTGCACTTTGTGCAGAATTC
TTGATTCTTGATTCTTGCAGAAATTTGCAAGAAAATTCGCAAGAAATTTG
TATTAAAAACTGTTCAAAATTTTTGGAAATTAGTTTAAAAATCTCACATT
TTTTTTAGAAAAATTATTTTTAAGAATTTTTCATTTTAGGAATATTGTTA
TTTCAGAAAATAGCTAAATGTGATTTCTGTAATTTTGCCTGCCAAATTCG
TGAAATGCAATAAAAATCTAATATCCCTCATCAGTGCGATTTCCGAATCA
GTATATTTTTACGTAATAGCTTCTTTGACATCAATAAGTATTTGCCTATA
TGACTTTAGACTTGAAATTGGCTATTAATGCCAATTTCATGATATCTAGC
CACTTTAGTATAATTGTTTTTAGTTTTTGGCAAAACTATTGTCTAAACAG
ATATTCGTGTTTTCAAGAAATTTTTCATGGTTTTTCTTGGTCTTTTCTTG
(...)
FASTA file
input
Read
Genome
Probabilities
output
Exploder
Combiner
Aligner
Figure 3.1: Read Mapping Pipeline scheme. The arrows indicate how the different modules areconnected. In white are represented the ones that can use different algorithms to execute theirassigned tasks.
3.2 Implementation
Java SE-1.7 was used to implement the prototype of our solution, for which the user provides a
FASTA file —- the reference sequence in text format—, a FASTQ file — containing a set of reads
— and a .properties file (Figure 3.2) through the command line. The later file contains all the
parameters our pipeline needs, and gives the advantage of not having to write all the required
parameters values at the console each time we run the tool for a different (or not) set of files
without changing the parameters values, i.e., it keeps the command line less verbose.
A Class (Oracle Corporation, 2015n) named Start has the main method (Oracle Corpora-
tion, 2015p), wherein the Genome and the Read objects (Oracle Corporation, 2015q) are created
from the files provided. Using an instance of the class Properties (Oracle Corporation, 2015h) we
can load the following relevant parameters from the .properties file:
29
3. Read Mapping Pipeline
• the k value to be used by classes Genome, to hash the reference sequence, and Read, to
return a k-sized key from the read that will serve as a base by an instance of Exploder for
the generation of new keys;
• the names of the classes used to instantiate the Exploder and Aligner modules, as we allow
the user to write its own mechanisms for these modules, and run them in our pipeline
with ease (an interface has to be respected when implementing new approaches for these
components);
• the name of the class that will be used to present the results for the class Combiner; and,
• the location, where the output files should be stored.
k=10 exploder=keys_exploder.Exploder_0 aligner=alignment.NW comb_type = util.LLC_Comb cuttof = 0.90 threshold = 0.90 top = 3 output_dir = /local DEBUG=false
Figure 3.2: Example of a .properties file content. In the case of the algorithms the user mustpay attention to the names since the code is divided into packages (Oracle Corporation, 2015o).Here, the subclasse Exploder 0 is set to instantiate the Exploder and NW the Aligner; al-though both top and threshold have values assigned, the Combiner object will be created withLLC_Comb that uses the display based on a top rank.
In addition, the .properties file has three parameters which use depends on the class names
given: the cutoff, which is only used by some instances of the class Exploder; the top is re-
quired to present the results based on a top rank, and the threshold in case the user choose the
GNUMAP-based (Clement et al., 2010) way to sort the results. Hence, an object of the LLCProp-
erties is created with the mentioned Properties instance, allowing to retrieve these parameters
without passing them through the constructors of the classes Exploder and Combiner. This
way, the values for the cutoff, the top and the threshold are only called when needed (Figure 3.2).
30
3.2. Implementation
3.2.1 Multithreading
Mapping billions of reads to a wide reference genome is a computationally heavy process, even
more if we intend to expand the search space within the reference genome. Resorting to multi-
threading parallel computing, as Li et al. (2008b) did and others followed, will lighten the pro-
cess and make it globally faster by distributing the work across various processors. Hence, the
multithreading feature, supported in all current operative systems is extensively used by our
prototype, so that we can distribute the work between available processor cores in a machine.
The class Start holds the main functionality that materialises the use of multithreading.
A thread pool is established using the newFixedThreadPool method (Oracle Corporation, 2015r)
and to execute a new thread the class ReadProcessing, that implements the interface Runnable
(Oracle Corporation, 2015l), was created.
We have as many threads as the number of processors available to the Java virtual machine
(minus one, which we decided to keep free to minimize interference with other operative system
tasks), and an equal number of lists with Read objects and of FileWriter instances — to write the
output files — are also created and associated with each of the processing threads. This allows
threads to own their own resources and operate with minimal coordination, a key aspect to
ensure a good performance. Afterwards, each thread, executed with a new instance of the class
ReadProcessing, receives a list of Read objects, the Genome object, a FileWriter object and the
information retrieved from the .properties file. For each Read instance, the implemented run()
method creates an instance of the class Combiner to call its combine method to process the
mapping pipeline. The pipeline results are retrieved with the getResults method from Combiner
and an instance of the class StringBuffer (Oracle Corporation, 2015i) constructs a String from
them, which the FileWriter object writes in an output file.
3.2.2 Genome
The Genome class was created to manage the information retrieved from the FASTA file, that
contains the genomic sequence in text format. First, the genomic reference sequence is as-
31
3. Read Mapping Pipeline
sembled and an index is created, where each entry has a subsequence — the key —- and a list
of genomic positions in which the key is found — the value. Therefore, an instance of this class
is constructed with the name of the FASTA file and the value of k. Then, a StringBuffer instance
creates a String from the sequence lines read from the FASTA file — i.e., genome. The genome
is then converted to a character array, so it can be iterated and directly accessed (for perform-
ance), and a HashMap (Oracle Corporation, 2015e) structure stores a k-sized sequence of bases
— key — and an ArrayList (Oracle Corporation, 2015b) with its the genomic positions — value.
Afterwards, the index will be scanned to retrieve the genomic locations that match a sub-
sequence — a key — from the read, i.e, a seed is searched. Once a hit is found, for the extend part
a genomic subsequence is needed to be aligned with the read; hence, this class also implements
the method genSeq, where given the reference sequence as an array of characters, the size of
the read and the list of genomic positions, obtained from the seed search, it returns an ArrayList
containing a collection of SimpleEntry (Oracle Corporation, 2015a) instances. Because, we need
to know to where the read was mapped (in case of a positive alignment), each SimpleEntry in-
stance is composed by a subsequence of the genome to be aligned — a character array — and
its respective position — an Integer.
3.2.3 Read
At the method that serves as entry point to the Start class, the FASTQ file is read; each Read
object is created with the identification header, base sequence and line of scores for a read, and
with the value of k. The ASCII coded scores are used to create an array of Probabilities instances,
from which we get the probability associated with each of the four bases for each position of the
read.
The class Read implements the following methods:
• simpleKey — that returns a k-sized subsequence from the beginning of the read;
• bestKey — wherein the subsequence of k bases with the best score, obtained by a sliding
window algorithm (Clement et al., 2010) (Figure 3.3), is returned; and
32
3.2. Implementation
• getSeqAlign — which returns the read as a character array to be used at the alignment by
an instance of the class Aligner.
The methods simpleKey and bestKey will be used by classes of the module Exploder.
A GTGAAGCTTC TAGATATTTGGCGGGTACCTCTAATTTGCCTGCCTGCCAACCTATATGCTCCTGT
k
Figure 3.3: Sliding window. A k-sized window moves one position in the sequence at a time toretrieve a subsequence with k bases (Clement et al., 2010).
3.2.4 Probabilities
At the FASTQ file the scores for each base called are ASCII coded and have values between 33
and 126 (Cock et al., 2010); we convert them to values between 0 and 93 (Phred scale) and then
to probabilistic values using the Phred equation:
QPhr ed =−10× log10(Pe )
which gives the probability of error for the base called. We want the probability of each base
being correct (P) as
P = 1−10−
QPhr ed
10
Since the FASTQ file only gives the quality for the base called we assume the uncalled bases
have the same probability:
Puncal led_base =1−P
3
in the case of the "unknown" character "N", each of the four bases has a 25% probability of being
correct. Hence, a Probabilities object stores for each of the four bases and "N" their probabilit-
ies of being correct.
33
3. Read Mapping Pipeline
3.2.5 Exploder
After the management and storage of the information from the input files, the next step in our
pipeline is to generate the keys for the reads to scan the reference genome. In this work, we
propose to expand the search space by assigning more than one search key for each read (thus,
’exploding’ the number of keys from one to several), and to do it we developed a series of al-
gorithms to do it. An Abstract class (Oracle Corporation, 2015m) allows to share the common
code and the parameters between the subclasses that implement these algorithms, and to eas-
ily select one of them to create an Exploder object to perform the task.
Our series of exploding algorithms rely on a k-sized subsequence of the read as a basis
to generate new keys. So, for simple comparison the algorithm from the subclass Exploder 00
calls the method simpleKey from class Read, to simply return the first k bases of the sequence
as a key retrieved from position zero. On the other hand, the subclass Exploder 0 returns the
key computed by the method bestKey from class Read with its position at the read; the best
key is retrieved by a sliding window algorithm (Clement et al., 2010) and corresponds to the
subsequence of k bases with the best score.
The subclass Exploder 1 implements an algorithm where the best key serves as a template
to generate new keys resorting to base permutation (Figure 3.4), the following classes Exploder
2, 3 and 4 implement versions of that algorithm where biological constraints and/or quality
values are used to reduce the number of keys generated. We also have the subclass GNUMAP
which implements the algorithm used by Clement et al. (2010) at their tool to create keys for the
reads.
The class Exploder needs a Read object and the value of k to be instantiated and collects
SimpleEntry objects composed by a String — the base sequence generated — and an Integer
— of its position in theread — in an ArrayList. The different algorithms are coded at the ab-
stract method explode, implemented in the subclasses referred, that will take the Read object,
an index value, so the algorithm knows where to start, and two ArrayLists, one to record the
temporary results and other for the final results. The executeExplosion method calls the explode
with index zero and the required parameters; and, explodeKeys returns the ArrayList with the fi-
34
3.2. Implementation
Best key: CTCACCCGTT
ATCACCCGTT
GTCACCCGTT
TTCACCCGTT
(...)
CTCACCCGTT
CACACCCGTT
CCCACCCGTT
CGCACCCGTT
CTCACCCGTT
(...) AACACCCGTT
ACCACCCGTT
AGCACCCGTT
ATCACCCGTT
(...) CACACCCGTT
CCCACCCGTT
CGCACCCGTT
CTCACCCGTT
TACACCCGTT
TCCACCCGTT
TGCACCCGTT
TTCACCCGTT
(...) TTAACCCGTT
TTGACCCGTT
TTTACCCGTT
TTCACCCGTT
(...) TAAACCCGTT
TAGACCCGTT
TATACCCGTT
TACACCCGTT
(...) TCAACCCGTT
TCGACCCGTT
TCTACCCGTT
TCCACCCGTT
(...) TGAACCCGTT
TGGACCCGTT
CGTACCCGTT
TGCACCCGTT
Figure 3.4: Scheme of the algorithm for the best key explosion. Each position of the best keyhas its base exchanged for each of the other remaining three bases to generate three new keys.The keys created will expand the search space over the genome for each read. Since the key hasa size of 10 bases, with this example 1 048 576 keys would be returned to be searched.
nal results. When needed, the cutoff value is retrieved from the .properties file using an instance
of the LLCProperties class.
Exploder 1
Our approach followed the idea of expanding the reference sequence search space by taking
into account more than one key for each read. The algorithm implemented in this subclass fol-
lows the one depicted in Figure 3.4 where from each position of a best key three new keys are
generated by a base exchange. In other, we have four nucleotide bases (A, C, G and T), for each
position a new key is created by exchanging the current base for each of the remaining three.
The new keys generated will go through the same base permutation at the next position. There-
35
3. Read Mapping Pipeline
fore, this algorithm generates new keys assuming every base called could be wrong exploding
the number of search keys from one to 4k .
Exploder 2
To scan a wide reference genome in search of 4k keys, even with parallel computation, requires
a great processing power. Therefore, to narrow the number of keys to search, the algorithm im-
plemented follows a scheme similar to the one depicted in Figure 3.4, but the base permutation
only occurs if the base probability (retrieved from the array if Probabilities object) is lower than
the cutoff value. Thus, only positions in which the base called has a low probability of being
correct will generate three new keys using the base permutation.
Exploder 3
Another strategy to narrow the number of keys generated by the algorithm from Exploder 1
is consider biological constraints to perform the base exchange. There is two types of nucle-
otide base substitution (Figure 3.5): between the two-ring purines or the one-ring pyrimidines
— transition — and between one purine and one pyrimidine — transversion (Freese, 1959).
Purines
Pyrimidines
Transition Transversion
A
C T
G
Figure 3.5: Definition of transition and transversion. The nitrogenous bases are divided in twogroups: pyrimidines — includes Cytosine (C) and Thymine (T) — and purines, for the double-ringed bases — includes Adenine (A) and Guanine (G).
Due to degeneracy of the genetic code, a transition is more likely to encode for the same
aminoacid and transversions have more pronounced effects. As one can see in Figure 3.5, there
36
3.2. Implementation
are twice as many possible transversions than transitions; however, approximately two out of
three single nucleotide polymorphisms (SNPs) are transitions (Collins and Jukes, 1994). Accord-
ingly, the algorithm of this subclass follows the scheme of Figure 3.4 only taking transitions into
account, i.e., for each position of the best key a new key is created by exchanging the current
base for its molecular similar base. The new keys generated will go through the same base per-
mutation at the next position. Thus, 2k keys are returned to be searched.
Exploder 4
The number of keys created by Exploder 3 can be decreased if we take the base quality scores
into account. This way, the algorithm implemented by subclass Exploder 4 restricts the key
production seen in Exploder 3 by generating a new key only if the current position of the best
key has a base called with a probability lower than the cutoff value. This means that Exploder 4
creates 1/3 of the keys comparatively to Exploder 2, if the best key has bases with a probability
lower than the cutoff value.
GNUMAP
Finally, we use a GNUMAP-based algorithm to explode keys wherein a consensus sequence of
bases is created, which means bases with lower probability of being correct are switched for
one of the remaining bases, that have higher probability; this approach was thought for the files
produced in the Solexa/Illumina pipeline where a _prb.txt has probabilities for each of the four
bases (Clement et al., 2010). A sliding window (Figure 3.3) go through the consensus string and
if the k-sized sequence does not contain a single "N" it is taken as a key with its position; other-
wise, k-sized the window moves to the next position.
3.2.6 Aligner
Bioinformatics often resorts to dynamic programming to find an alignment between two se-
quences, an example is the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), a
global sequence alignment algorithm. Although developed to align two proteins in full-length
it can also be applied to nucleotide sequences. This dynamic programming algorithm guaran-
37
3. Read Mapping Pipeline
tees to find the correct optimal alignment between two sequences of lengths n that are similar
across their entire lengths (has expected to occur between a read and the genomic subsequence
retrieved).
Since we aim to try three versions of the Needleman-Wunsch method that only differ in
two aspects — how an alignment score is calculated and the value attributed to a gap — we
also implemented the module Aligner as an abstract class. The class Aligner has the task of
align the base sequence of each Read (read) — a character array obtained through the method
getSeqAlign — to a subsequence of the Genome (g_seq) — a character array from the method
genSeq. The Algorithms 1 to 3 present the pseudocode followed in our implementation of the
Needleman-Wunsch method, where a matrix is build for the alignment.
Algorithm 1 Initialise matr i x.matr i x ← [length(g _seq) +1][length(r ead)+1]for i = 0 to length(g _seq) do
for j = 0 to length(r ead) doif i = 0 then
matr i x [i ][
j]←− j
else if j = 0 thenmatr i x [i ]
[j]←−i
elsematr i x [i ]
[j]← 0
end ifend for
end for
Algorithm 2 Fill matr i x.for i = 1 to length(g _seq) do
for j = 1 to length(r ead) doM atch ← matr i x [i −1]
[j −1
]+wei g ht (i , j )Inser t i on ← matr i x [i ]
[j −1
]+ g ap()Del eti on ← matr i x [i −1]
[j]+ g ap()
matr i x [i ][
j]← max(M atch,Del eti on, Inser t i on)
end forend for
In Algorithm 1, the first line — index 0 — of the matrix represents the genomic sequence
and the first column — index 0 — the read, the rest of the matrix is filled with scores according
to the equations present at Algorithm 2. At the end, the Algorithm 3 deduces the best alignment
38
3.2. Implementation
Algorithm 3 Compute AlignmentAli g nmentGen ← ""Ali g nmentRead ← ""i ← length(g _seq)j ← length(r ead)while i > 0 and j > 0 do
if matr i x [i ][
j]← matr i x [i −1]
[j −1
]+wei g ht (i , j ) thenAli g nmentGen ← g _seq [i −1]+ Ali g nmentGenAli g nmentRead ← r ead
[j −1
]+ Ali g nmentReadi ← i −1j ← j −1
else if matr i x [i ][
j]← matr i x [i ]
[j −1
]+ g ap() thenAli g nmentGen ← "−"+ Ali g nmentGenAli g nmentRead ← r ead
[j −1
]+ Ali g nmentReadj ← j −1
elseAli g nmentGen ← g _seq [i −1]+ Ali g nmentGenAli g nmentRead ← "−"+ Ali g nmentReadi ← i −1
end ifend whiler ever se(Ali g nmentGen)r ever se(Ali g nmentGen)
39
3. Read Mapping Pipeline
tracing back the matrix starting from the last cell to be filled with the scores, i.e., the bottom
right cell. From there it moves regarding the score value in three possible directions: diagonally
(towards the top-left corner of the matrix) — the bases from the two sequences are aligned —,
left — we assume an insertion relatively to the genome, a gap is introduced in the genomic sub-
sequence —, or up — we assume a deletion occurred, a gap is introduced in the read sequence.
Once the top-left cell is reached the alignment is complete, and since the sequences are aligned
backwards, the resulting strings must be reversed.
The following classes implement the abstract methods weigth(i, j) — to score the aligned
characters — and gap() — the value added when a character aligns with a gap —, used in Al-
gorithms 2 and 3. Since in the matrix the values for the genomic sequence start at (1, 0) and
for the read at (0, 1), the matrix cell (i, j) corresponds to the alignment between the characters
g _seq [i −1] and r ead[
j −1].
NW
Corresponds to a simple implementation of the Needleman-Wunsch algorithm in which the
gap() returns the value -1 and weight(i, j) returns 1 when g _seq [i −1] matches with r ead[
j −1]
or -1 in case of a mismatch.
NW plus Similarity Matrix
In this class, the Needleman-Wunsch method is enriched with a similarity matrix (Figure 3.6
(a)), which means a perfect match has the value of 2 and a match between similar bases scores
1. Thus, if a transition (Figure 3.5) occurs in the alignment is not dismissed as a mismatch lower-
ing the alignment score, but contributes to the global score. In other words, this method allows
for mutations due to similar base exchange in the alignment.
The method gap() returns the value -1, but the method for weight(i, j) returns a value fol-
lowing a similarity matrix (Figure 3.6 (a)), where a perfect match has the value of 2 and a mis-
match of -1. Because we are taking into account base similarity at the alignment, if a ’N’ occurs
either in the reference sequence or the read the similarity matrix returns zero.
40
3.2. Implementation
GNUMAP
This version of the alignment method is based on GNUMAP (Clement et al., 2010), where the
gap() returns the value of -4 and the weight is calculate taking into account the probabilities of
the four bases and a simple matrix (Figure 3.6 (b)) (mostly to ease the implementation, since it
follows the same score system of the first version — NW):
wei g ht (i , j ) = ∑b ∈A,C ,G ,T
P b × costg _seq[i−1],b
where the costg _seq[i−1],b is retrieved from the matrix.
A G C T
A 2 1 -1 -1
G 1 2 -1 -1
C -1 -1 2 1
T -1 -1 1 2
(a) A G C T
A 1 -1 -1 -1
G -1 1 -1 -1
C -1 -1 1 -1
T -1 -1 -1 1
(b)
Figure 3.6: Similarity Matrices. (a) Matrix used in the subclasse "NW plus Similarity Matrix";(b) matrix for the GNUMAP-based implementation of the Needleman-Wunsch method. In bothcases, the implementation of the matrix returns zero when an "N" appears in the alignment.
From the Read object we have an array of Probabilities objects with the probabilities of
each of the four bases being the correct one. The method for weight(i, j) sums all this probabilit-
ies weighted with the alignment score retrieved from the matrix in Figure 3.6 (b). Which means,
this version of the alignment method does not just dismiss a mismatch, it assumes all the bases
have a chance to be the right one.
3.2.7 Combiner
At the supporting class ReadProcessing an instance of the class Combiner is created using the
Genome and a Read objects and the k parameter. The class Combiner is responsible for man-
aging the connections between the components of the pipeline (Figure 3.1). To accomplished
41
3. Read Mapping Pipeline
this we implemented the method combine, which is invoked with the names of classes that
implement the algorithms to be used to materialise the modules Exploder and Aligner, these
names are taken from the .properties file. Therefore, this method is responsible of the mapping
process by calling the following methods to combine the algorithms, also from class Combine:
• getKeys — receives the name of the algorithm for exploding keys — a String object —, the
Read object and the k; then an instance of the class Exploder is created to invoke its meth-
ods executeExplosion and explodeKeys. Afterwards, the ArrayList of SimpleEntry<String,
Integer> is returned with the keys created and respective position at the read;
• keySearch — takes the ArrayList from the previous method and the Genome object. The
positions of each key at the reference genome that are to be searched and the hits are
returned in an ArrayList<Integer>. The positions at the read returned with the key were
considered in the search, i.e., if the search key came from position 5 the genomic position
(p) returned would be p−5. And, a HashSet (Oracle Corporation, 2015f) is used to tempor-
ally store the genomic positions found to avoid repeated genomic locations for the same
read;
• computeAlignment — with the alignment algorithm name, the Genome and the Read ob-
jects and the ArrayList returned from the keySearch, this method will compute the align-
ment. First, for each genomic position found a SimpleEntry, composed by a subsequence
of the genome to be aligned — a character array — and its respective position — an Integer
— is retrieved with the method genSeq, from the class Genome; then an Aligner instance
is created with the given name, the key of the SimpleEntry<String, Integer> and the Read.
The alignment result is recorded in a new instance of AlignerResult, which is created with
the read header from the FASTQ file, the sequence obtained from the alignment, its gen-
omic position — the value of the SimpleEntry — and its score.
PositionScore is another supporting class created to manage the alignment results; it has
the same parameters of AlignerResult, but allows to show the results according to the position
found at the genome or the score obtained for the alignment. PositionScore is an abstract class
with the follow concrete implementations:
42
3.2. Implementation
• PositionScore_S — implements the methods to compare the scores obtained for the align-
ment; and,
• PositionScore_P — compares the results by its positions found at the genome.
Both rely on the interface Comparable<T> (Oracle Corporation, 2015k) in the implementation.
Two different implementations of the Combiner were created, each having a different im-
plementation of the results_display() abstract method:
• GNUMAP_Comb — organise the results considering the score processing method provided
by GNUMAP (Clement et al., 2010), wherein the scores are normalised and only the ones
greater than a given threshold value are displayed. The scores are sorted by the position of
the alignment, so the PositionScore objects were instantiated with the PositionScore_P
and stored in a TreeSet (Oracle Corporation, 2015j). The threshold value is retrieved from
the .properties file using an instance of LLCProperties;
• LLC_Comb — where, first, the results are sorted by score and then the top scores are
searched and ordered by position in the genome. Both subclasses of PositionScore were
used to instantiate the objects, that were stored in TreeSet structures. The top value is
retrieved from the .properties file using an instance of LLCProperties.
Finally, the method getResults() returns the ArrayList of PositionScore objects sorted as described
above.
3.2.8 Abstract Classes Instantiation
To ease the creation and usability of new subclasses for the abstract classes Exploder, Aligner
and Combiner, the instantiation of their objects only requires the name of the chosen subclass
(and the respective arguments). Thus, the respective constructor, called using the Java com-
mand "Class.forName(name).getConstructors()[0]" (Oracle Corporation, 2015c), and the class
Constructor<T> (Oracle Corporation, 2015d) with its method newInstance dynamically instati-
ates the class, provided that it exists, opassing to its constructor the required arguments (an
array of Objects (Oracle Corporation, 2015g)).
43
Chapter 4
Results and Discussion
In the previous chapter the read mapping pipeline created in this thesis was introduced as
standing on the paradigm of modular programming. This feature enables the plugging of the
different algorithms implemented to generate search keys from a read and to align that read to
a candidate region of the reference sequence.
Therefore, to create an Exploder object we have the algorithms implemented at the classes:
Exploder 00, which returns a k-sized subsequence from the beginning of a read; Exploder 0,
where the search key created corresponds to the best key, the k-sized subsequence of a read
with the best quality values; Exploders 1, 2, 3 and 4 that use the best key as a template to gen-
erate new search keys relying on base permutation, taking into account base similarity and/or
quality values; and, GNUMAP where the search keys are created by dividing the reads in over-
lapping k-sized subsequences, as in GNUMAP (Clement et al., 2010). As for the alignment task,
three versions of the Needleman-Wunsch method were implemented: a simple one (NW); the
NW plus SM, where the method is enriched with a base similarity matrix; and, the GNUMAP-
based NW which considers the probability of each base of being the correct one in the alignment
(Clement et al., 2010).
We tested our read mapping pipeline to evaluate its performance regarding scalability —
the time required to execute the mapping as the number of reads to map grow —, coverage —
the percentage of reads that are effectively mapped — and precision — mapping reads to the
45
4. Results and Discussion
correct location in the reference genome. In this chapter, we present the results obtained for
three simulated datasets and draw the most relevant observations. Additionally, we executed
the prototype for a strain of Escherichia coli with real data.
First of all, the simulated datasets were created resorting to the ArtificialFastqGenerator
(Frampton and Houlston, 2012), which takes a reference sequence as input and outputs artificial
FASTQ files and .readStartIndexes files, that provide the positions of the reference sequence from
where the reads were retrieved. With ArtificialFastqGenerator we can use real Phred base quality
scores from existing FASTQ files and simulate sequencing errors. Hence, from the sequence of
Mus musculus (house mouse) chromosome 191, which has over 61 mega-base-pairs, a FASTQ
file with 1 494 305 reads of 100 bases was generated with a coverage mean peak of 10 — i.e.,
the peak coverage mean for a region of the sequence. In addition, the run SRR000868 from the
454 sequencing of Escherichia coli UTI89 (Chen et al., 2006) genomic fragment library2, that
corresponds to a FASTQ file, was used to retrieve the real base quality scores, and sequencing
errors were simulated. Afterwards, from the FASTQ file created we made three datasets with
different sizes — 1000, 2000 and 3000 — of randomly selected reads, these datasets were the
input to test our read mapping tool. Note that we use artificially generated read sets as this is
the only alternative to allow us to compute the precision of the algorithm (as we effectively know
the correct position of each mapped read).
Since we want to compare the probabilistic variant of the Needleman-Wunsch algorithm,
created for GNUMAP (Clement et al., 2010) to accurately map reads with lower confidence val-
ues, with simpler versions of the alignment method, the reads were not preprocessed to remove
the ones with lower quality. For the tests, we set the cuttof value, required by the algorithms of
the Exploder, at 0.90 and used a top 3 to display the results. Exploder 1 was excluded due to
generating 4k keys, where in k equals 10 in our tests, which imply a huge processing power.
1Mus musculus chromosome 19 sequence: http://www.ebi.ac.uk/ena/data/view/CM0010122Run SRR000868: http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR000868
46
4.1. Scalability
4.1 Scalability
To see how each combination of algorithms for the Exploder and the Aligner components of
the pipeline perform in terms of scalability, we plotted two types of graphics: one that relates
runtime, in hours, with the number of reads in each dataset (Figures 4.1 to 4.6) and other that
compares the combinations for the three datasets (Figures 4.7 to 4.12). We executed our read
mapping tool in two machines with different features:
• Machine 1: has 63 AMD Opteron(TM) 6272 processors with 1400.000 MHz of speed and
63 Gb of memory;
• Machine 2: has 23 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz processors with 2400.398
MHz of speed and 62 Gb of memory.
Thus, although Machine 2 has fewer processor units than Machine 1 it requires less time to ex-
ecute our tool (Figures 4.1 to 4.12) (as the processor is faster) which shows that it is CPU intensive
(and bound).
From the Figures 4.1 to 4.6 we can observe a clear relation between the number of reads
and the runtime; notwithstanding, the 2000 reads dataset required more time than expected
(when compared with the linear relation of the other datasets) to be processed when using the
Exploder 2 to generate the search keys (Figures 4.9 and 4.10). The algorithm responsible for
this outlier relies on the quality scores, producing three new keys for each position below the
cutoff, so we may infer that some of the reads of this set had lower quality scores resulting in
more keys created, which leads the algorithm to inspect more locations of the reference genome,
and potentially align the read to more locations. Because we did not preprocessed the reads to
discard the ones with lower scores, we believe this is the case.
On the other hand, the scalability for the algorithms Exploders 00, 0, 2 and 4 is very similar
within the datasets, with the exception explained above (Figures 4.1 to 4.12), this could mean
that the best key used in Exploders 0, 2 and 4 simply corresponds to the first 10 bases, in our
case, — the simple key. However, Exploder 4 only creates one new key for each position below
the cutoff value, i.e, generates one third of the keys when compared with Exploder 2, reducing
the search space within the reference sequence.
47
4. Results and Discussion
0
5
10
15
20
25
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
NW
Exploder 00 Exploder 0 Exploder 2
Exploder 3 Exploder 4 GNUMAP
0
1
2
3
4
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
Figure 4.1: Runtime vs Number of Reads (NW). Relation between runtime, in hours, and num-ber of reads for each generating search keys algorithm using the simple implementation of theNeedleman-Wunsch method for the alignment. Results from Machine 1.
0
2
4
6
8
10
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
NW
Exploder 00 Exploder 0 Exploder 2
Exploder 3 Exploder 4 GNUMAP
0.0
0.5
1.0
1.5
2.0
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
Figure 4.2: Runtime vs Number of Reads (NW). Relation between runtime, in hours, and num-ber of reads for each generating search keys algorithm using the simple implementation of theNeedleman-Wunsch method for the alignment. Results from Machine 2.
48
4.1. Scalability
0
5
10
15
20
25
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
NW plus SM
Exploder 00 Exploder 0 Exploder 2
Exploder 3 Exploder 4 GNUMAP
0
1
2
3
4
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
Figure 4.3: Runtime vs Number of Reads (NW plus SM). Relation between runtime, in hours,and number of reads for each generating search keys algorithm using the Needleman-Wunschmethod enriched with a Similarity Matrix for the alignment. Results from Machine 1.
0
2
4
6
8
10
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
NW plus SM
Exploder 00 Exploder 0 Exploder 2
Exploder 3 Exploder 4 GNUMAP
0.0
0.5
1.0
1.5
2.0
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
Figure 4.4: Runtime vs Number of Reads (NW plus SM). Relation between runtime, in hours,and number of reads for each generating search keys algorithm using the Needleman-Wunschmethod enriched with a Similarity Matrix for the alignment. Results from Machine 2.
49
4. Results and Discussion
0
5
10
15
20
25
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
GNUMAP-based NW
Exploder 00 Exploder 0 Exploder 2
Exploder 3 Exploder 4 GNUMAP
0
0.5
1
1.5
2
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
Figure 4.5: Runtime vs Number of Reads (GNUMAP-based NW). Relation between runtime,in hours, and number of reads for each generating search keys algorithm using the GNUMAP-based Needleman-Wunsch method for the alignment. Results from Machine 1.
0
2
4
6
8
10
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
GNUMAP-based NW
Exploder 00 Exploder 0 Exploder 2
Exploder 3 Exploder 4 GNUMAP
0
1
2
3
4
1000 2000 3000
Ru
nti
me
(H
ou
r)
Number of Reads
Figure 4.6: Runtime vs Number of Reads (GNUMAP-based NW). Relation between runtime,in hours, and number of reads for each generating search keys algorithm using the GNUMAP-based Needleman-Wunsch method for the alignment. Results from Machine 2.
50
4.1. Scalability
0
2
4
6
8
10 R
un
tim
e (
Ho
ur)
Exploder
1000 reads
NW NW plus SM GNUMAP-based NW
0
0.3
0.6
0.9
1.2
1.5
Ru
nti
me
(H
ou
r)
Exploder
Figure 4.7: Runtime vs Exploder and Aligner Combination (1000 reads). Time, in hours, re-quired to execute the pipeline for 1000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 1.
0
1
2
3
4
Ru
nti
me
(H
ou
r)
Exploder
1000 reads
NW NW plus SM GNUMAP-based NW
0.00
0.25
0.50
0.75
Ru
nti
me
(H
ou
r)
Exploder
Figure 4.8: Runtime vs Exploder and Aligner Combination (1000 reads). Time, in hours, re-quired to execute the pipeline for 1000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 2.
51
4. Results and Discussion
0
4
8
12
16 R
un
tim
e (
Ho
ur)
Exploder
2000 reads
NW NW plus SM GNUMAP-based NW
0
0.5
1
1.5
2
2.5
Ru
nti
me
(H
ou
r)
Exploder
Figure 4.9: Runtime vs Exploder and Aligner Combination (2000 reads). Time, in hours, re-quired to execute the pipeline for 2000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 1.
0
2
4
6
Ru
nti
me
(H
ou
r)
Exploder
2000 reads
NW NW plus SM GNUMAP-based NW
0
0.25
0.5
0.75
1
Ru
nti
me
(H
ou
r)
Exploder
Figure 4.10: Runtime vs Exploder and Aligner Combination (2000 reads). Time, in hours, re-quired to execute the pipeline for 2000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 2.
52
4.1. Scalability
0
5
10
15
20
25 R
un
tim
e (
Ho
ur)
Exploder
3000 reads
NW NW plus SM GNUMAP-based NW
0
1
2
3
4
Ru
nti
me
(H
ou
r)
Exploder
Figure 4.11: Runtime vs Exploder and Aligner Combination (3000 reads). Time, in hours, re-quired to execute the pipeline for 3000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 1.
0
3
6
9
Ru
nti
me
(H
ou
r)
Exploder
3000 reads
NW NW plus SM GNUMAP-based NW
0
0.5
1
1.5
Ru
nti
me
(H
ou
r)
Exploder
Figure 4.12: Runtime vs Exploder and Aligner Combination (3000 reads). Time, in hours, re-quired to execute the pipeline for 3000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 2.
53
4. Results and Discussion
The Exploder 3 algorithm produces 2k keys, therefore, since k equals 10, for each read
this algorithm returns 1024 search keys resulting in greater increases between datasets (Figures
4.1 to 4.6) and in a higher runtime than with the other exploding keys algorithms (Figures 4.7 to
4.12). The GNUMAP-based exploding algorithm creates as much keys as the size of the read, our
reads have 100 bases meaning we get 100 keys per read, leading to a significantly lower runtime
when compared with the previous algorithm. Because Exploder 2 required less time than the
GNUMAP-based algorithm one can assume fewer keys were made up by the former for all the
reads of the datasets, even for the one with 2000 reads suggesting that only a few reads were
responsible for the runtime spikes observed (Figures 4.9 and 4.10).
Concerning the algorithms for the Aligner module, Figures 4.7 to 4.12 show that the com-
binations with the GNUMAP-based Needleman-Wunsch method takes considerably less time to
execute the mappings. Moreover, the implementation of the method with the similarity matrix
— NW plus SM — slightly increases the runtime when compared to the simple implementa-
tion, which could be due to access time to the matrix. Although, some results from Machine 1
(Figures 4.7 , 4.9 and 4.11) contradict this tendency, but since results from Machine 2 (Figures
4.8, 4.10 and 4.12) follows the tendency we may assume these outliers have a technical reason
related with the machine used. On the other hand, only a few implemented methods between
the versions of the Needleman-Wusnch algorithm were different, thus, we need further tests to
understand what makes the GNUMAP-based NW require a lower runtime.
Overall, our tool needs to have its scalability improved. The human genome has over 3
giga-base-pairs and billions of reads are produced by NGS platforms from a single sample, if
our read sets take such times to have their reads mapped to the comparatively small reference
sequence then our read mapping pipeline has a narrow spectrum of utility. However, since the
prototype is CPU bound, distributing the work between more machines, as in a cloud computer
platform, may be a solution.
54
4.2. Coverage
4.2 Coverage
As expected, since we used simulated data, the three datasets obtained 100% coverage, i.e, all
the reads were mapped to its original sequence, for each combination of exploding search keys
algorithms (Exploder) and variants of the Needleman-Wunsch method (Aligner).
4.3 Precision
Though precision concerns mapping the reads to their true positions at the genome, we ana-
lysed the performance of the various algorithmic combinations for the Exploder and Aligner
components of our tool considering:
• Multiple Locations: when a read was mapped to more than one location;
• Possible Locations: from the multiple locations found for a read, if more than one results
from an alignment with a score higher than 0.85 we consider these "extra" as possible
locations for the read. This aspect has greater significance when dealing with real data for
which we have no idea from where is the read mapped, and because of repetitive DNA
sequences it is a major challenge for NGS data analysis;
• Best Location: if a read is mapped to one genomic position with a score over 0.85 then it
is the best location found; at best, within our simulated datasets this corresponds to the
original location of the read at the reference sequence;
• Incorrectly Mapped: the position(s) returned do not match the original one within the
reference sequence.
The results obtained for these parameters are summarised in Figures 4.13 to 4.15. Some obser-
vations can be drawn from these figures: first, approximately 100% of the reads were mapped to
Multiple Locations regardless the combination of algorithms executed. A relevant number of
reads within each dataset was mapped to a Best Location. Despite the multiple locations result,
a small proportion of reads were mapped to other Possible Locations.
55
4. Results and Discussion
893 893 928
886 909 908
69 69 31 43 51
0
1000 1000 1000 1000 1000 1000
36 36 38 70
37 89
Nu
mb
er
of
Re
ads
NW
886 886 921
869 903 880
69 69 31 43 51
0
1000 1000 1000 1000 1000 1000
43 43 46 88
44 118
Nu
mb
er
of
Re
ads
NW plus SM
898 898 934
905 916 931
69 69 31 43 51
0
1000 1000 1000 1000 1000 1000
25 25 26 45 25 60
Nu
mb
er
of
Re
ads
GNUMAP-based NW
Best Location
Incorretly Mapped
Multiple Locations
Possible Locations
Figure 4.13: Mapping Results for 1000 reads. The number of reads for which was found morethan one location is under Multiple Locations of them Possible Locations are those that scoredover 0.85 at the alignment; if only one location with a score higher than 0.85 was found this isthe Best Location. The Incorrectly Mapped are the reads that did not mapped in its originalposition at the reference sequence. The horizontal axis represent each combination betweenthe generating search keys algorithms (Exploders 00, 0, 2, 3 and 4 and the GNUMAP-basedalgorithm) and our variants of the Needleman-Wunsch method (NW — simple implementation—, NW plus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW). Results for1000 simulated reads dataset.
56
4.3. Precision
1785 1785 1848
1786 1803 1837
138 138 73 97 120
0
1997 1997 1997 2000 1997 2000
76 76 76 117 76 160
Nu
mb
er
of
Re
ads
NW
1766 1766 1830
1751 1784 1796
138 138 73 97 120
0
1997 1997 1997 2000 1997 2000
96 96 96 154 96
203
Nu
mb
er
of
Re
ads
NW plus SM
1804 1804 1865 1810 1820
1881
138 138 73 97 120
0
1997 1997 1997 2000 1997 2000
49 49 49 81 49 104
Nu
mb
er
of
Re
ads
GNUMAP-based NW
Best Location
Incorretly Mapped
Multiple Locations
Possible Locations
Figure 4.14: Mapping Results for 2000 reads. The number of reads for which was found morethan one location is under Multiple Locations of them Possible Locations are those that scoredover 0.85 at the alignment; if only one location with a score higher than 0.85 was found this isthe Best Location. The Incorrectly Mapped are the reads that did not mapped in its originalposition at the reference sequence. The horizontal axis represent each combination betweenthe generating search keys algorithms (Exploders 00, 0, 2, 3 and 4 and the GNUMAP-basedalgorithm) and our variants of the Needleman-Wunsch method (NW — simple implementation—, NW plus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW). Results for2000 simulated reads dataset.
57
4. Results and Discussion
2699 2699 2808
2688 2735 2733
196 196 80 133 159
0
2996 2996 2998 3000 2998 3000
97 97 102 177 98 256
Nu
mb
er
of
Re
ads
NW
2668 2668 2777
2619 2704 2646
196 196 80 133 159
0
2996 2996 2998 3000 2998 3000
135 135 141 257
136
351
Nu
mb
er
of
Re
ads
NW plus SM
2715 2715 2826
2721 2751 2824
196 196 80 133 159
0
2996 2996 2998 3000 2998 3000
75 75 78 132 76 159
Nu
mb
er
of
Re
ads
GNUMAP-based NW
Best Location
Incorretly Mapped
Multiple Locations
Possible Locations
Figure 4.15: Mapping Results for 3000 reads. The number of reads for which was found morethan one location is under Multiple Locations of them Possible Locations are those that scoredover 0.85 at the alignment; if only one location with a score higher than 0.85 was found this isthe Best Location. The Incorrectly Mapped are the reads that did not mapped in its originalposition at the reference sequence. The horizontal axis represent each combination betweenthe generating search keys algorithms (Exploders 00, 0, 2, 3 and 4 and the GNUMAP-basedalgorithm) and our variants of the Needleman-Wunsch method (NW — simple implementation—, NW plus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW). Results for3000 simulated reads dataset.
58
4.3. Precision
To clearly see the influence of each combination on the mapping aspects, the results re-
ported in Figures 4.16 and 4.17 were plotted with the ratio of reads within each dataset for which
was found the possible and a best locations, respectively. Finally, the number of Incorrectly
Mapped reads appears to only be related with the algorithm chosen for the Exploder compon-
ent of the read mapping pipeline as it remains equal within the different implementations of the
Needleman-Wunsch method. Therefore, to see the rate of incorrectly mapped reads for each
dataset we represent the results regarding the exploding keys algorithms in Figure 4.18.
0%
5%
10%
15%
1000 2000 3000
%P
oss
ible
Lo
cati
on
s
Number of Reads
NW
0%
5%
10%
15%
1000 2000 3000
%P
oss
ible
Lo
cati
on
s
Number of Reads
NW plus SM
0%
5%
10%
15%
1000 2000 3000
%P
oss
ible
Lo
cati
on
s
Number of Reads
GNUMAP-based NW
Exploder 00
Exploder 0
Exploder 2
Exploder 3
Exploder 4
GNUMAP
Figure 4.16: Rate of Reads with other Possible Locations (%). Reads mapped to more than onelocation with an alignment score over 0.85 have other Possible Locations despite just havingone original position at the reference sequence. This results represent the rate of possible loc-ations found with each combination between the generating search keys algorithms and ourimplementations of the Needleman-Wunsch method (NW — simple implementation —, NWplus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW).
59
4. Results and Discussion
85%
88%
90%
93%
95%
1000 2000 3000
%B
est
Lo
cati
on
Number of Reads
NW
85%
88%
90%
93%
95%
1000 2000 3000
%B
est
Lo
cati
on
Number of Reads
NW plus SM
85%
88%
90%
93%
95%
1000 2000 3000
%B
est
Lo
cati
on
Number of Reads
GNUMAP-based NW
Exploder 00
Exploder 0
Exploder 2
Exploder 3
Exploder 4
GNUMAP
Figure 4.17: Rate of Reads with a Best Location found (%). A Best Location was found for readsmapped to one location with an alignment score over 0.85. This results represent the rate of bestlocations found with each combination between the generating search keys algorithms and theimplementations of the Needleman-Wunsch method (NW — simple implementation —, NWplus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW).
Mapping results depend on the reads from a dataset, however from Figure 4.16 we can
observe that the best key used by Exploders 0 leads to the same number of possible locations
as the simple key returned by Exploder 00. Moreover, if Exploder 4 assigned more than one key
for the reads they did not improve the search for other locations within the reference sequence.
With Exploder 2 we had little contribution from the keys generated to find other locations for
the (simulated) reads. Yet, producing keys with Exploder 3 and the GNUMAP-based algorithm
seems to increase the proportion of reads mapped that may belong to more than one place,
60
4.3. Precision
being the later algorithm the greater contributor for this result with an increase up to 7%.
Furthermore, the number of reads mapped to more than one position with a relevant score
depends on the version of the Needleman-Wunsch method chosen for the Aligner (Figure 4.16).
For each dataset the number of possible locations found for the method with the similarity mat-
rix (NW plus SM) almost doubles relatively to the GNUMAP-based NW, reaching 12% of reads
finding other possible locations (being the simple implementation — NW — in between). Since
the alignment with a base similarity matrix is more tolerant to base mutations, it makes sense
that with this strategy we have found more possible locations for the reads. This is a particularly
interesting result when dealing with real data that may contain nucleotide variations.
Meanwhile, more reads found their best location (Figure 4.17) with the GNUMAP-based
NW, specially if combined with the GNUMAP-based exploding keys algorithm; for instance,
this combination resulted in 94% of the 3000 reads set finding their best location. NW and NW
plus SM led to higher results when combined with Exploder 2, observing increases of 4%, but
when discovering the best position for the reads within a dataset performed a bit worse than the
GNUMAP-based NW, being the NW plus SM implementation the least recommended for this
task.
Figure 4.17 allow us to reflect more about the effect of expanding the search space within
the reference sequence by assigning multiple keys to the reads. For instance, with the Exploders
00 and 0 we get the same results confirming that our best key is within the first k bases — our
simple key. Exploder 2 increases the chance of discover a best genomic position for a read, and
we can observe in Figure 4.17 a decrease for the dataset with 2000 reads for this algorithm and for
Exploder 4. These two algorithms rely on quality scores to create new keys, thus, as mentioned
previously, we can infer that some reads of the 2000 reads dataset have a lower quality score
within the first 10 (our value for k) bases. Hence, the increase in the runtime when resorting
to Exploder 2 to explode the keys for this dataset comparing to Exploders 00, 0 and 4. The
GNUMAP-based exploding keys algorithm only outperforms Exploder 2 when combined with
the GNUMAP-based NW and for the 2000 and 3000 reads datasets. Further tests with different
datasets may draw conclusions about the competency of Exploder 2 to find the best location for
a read when resorting to the GNUMAP-based NW for the alignment.
61
4. Results and Discussion
Generating new keys taking into account base similarity to switch each base of the best
key, as we did with Exploder 3, did not do much to find the best location for a read. Unless it
was combined with the GNUMAP-based NW, performing a little better than relying on the first
10 bases to do it, being this effect more observable when referencing the results from the 3000
reads dataset.
0%
2%
4%
6%
8%
1000 2000 3000
% In
corr
ect
ly M
app
ed
Number of Reads
Exploder 00
Exploder 0
Exploder 2
Exploder 3
Exploder 4
GNUMAP
Figure 4.18: Rate of Incorrectly Mapped Reads (%). The Incorrectly Mapped reads are thosethat did not mapped in its original position at the reference sequence. Since the number ofincorrectly mapped reads seems to be related with the generating search keys algorithm used,this results represent the rate for each Exploder implementation.
Finally, from Figure 4.18 we are can observe that producing keys by passing a sliding win-
dow along a read and retrieve k-sized sub-sequences as keys, as we did with the GNUMAP-based
algorithm, results in 100% precision, i.e, all the reads were mapped to its original position. With
Exploder 2 we increase the ratio of Incorrectly Mapped reads, followed by Exploder 3 and then
by Exploder 4. With these three algorithms we had a 1% increase of the rate for the dataset
with 2000 reads, supporting our observation that this dataset has a higher number of reads with
lower quality scores within the first 10 bases. Therefore, in terms of precision, too many keys
cause worse results. Using Exploders 00 and 0, that only rely on the first 10 bases to search
within the reference sequence, resulted in more incorrectly mapped reads when compared with
the previous algorithms.
62
4.4. Escherichia coli UTI89
4.4 Escherichia coli UTI89
From the real data from which we obtained the Phred quality scores we selected 2000 random
reads to map against its sample complete genome3, a sequence longer than 5 mega-base-pairs.
0
2
4
6
Ru
nti
me
(H
ou
r)
Exploder
NW NW plus SM GNUMAP-based NW
0
0.25
0.5
0.75
1
00 0 2 4
Ru
nti
me
(M
in)
Exploder
Figure 4.19: Runtime vs Exploder and Aligner Combination (E. coli UTI89). Time, in hours,required to execute the pipeline for 2000 real reads of E. coli UTI89; a close-up in minutes isdenoted to better notice the time variation within some of the combinations. Each combinationbetween the generating search keys algorithms — Exploder (horizontal axis) — and the versionsof the Needleman-Wunsch method (NW — simple implementation —, NW plus SM — with aSimilarity Matrix —, and GNUMAP-based NW) is represented. Results from Machine 2.
Figure 4.19 compares the time required for each algorithm combination to map the real
data, and only Machine 2 was used to execute the tool for this dataset. Two aspects about the
real data have to be taken into account when compared with the simulated data: the reference
genome is smaller than the M. musculus chromosome 19 sequence, and the read length ranges
between 100 and 300 bases. Thus, when combined with the simple version of Needleman-
Wunsch method (NW) or the one with the similarity matrix (NW plus SM) the algorithms for the
Exploder that just returned the simple or the best key — Exploders 0 and 00 — or relied on base
quality scores — Exploders 2 and 4 — required very little time to execute. Yet, when creating the
search keys with Exploder 3 the time required to map the 2000 real reads was almost as long as
with the 2000 simulated data (Figure 4.10), because for every read it explodes one search key to
3Escherichia coli UTI89, complete genome: http://www.ncbi.nlm.nih.gov/nuccore/91209055?report=fasta
63
4. Results and Discussion
210. And, the GNUMAP-based algorithm, that generates as many search as the read length, al-
most doubled the runtime when compared with the same amount of simulated data. However,
perform the alignment with GNUMAP-based NW leads to significantly lower runtime regardless
the exploding keys algorithm. Moreover, these last combinations also required less time to map
the real reads than the simulated ones, which maybe be due to the size of the E. coli genomic
sequence.
From the close-up presented in Figure 4.19 we observe that Exploders 2 and 4 required
more time than Exploders 0 and 00 to execute when combined with NW and NW plus SM,
meaning that some reads have lower quality scores and consequently few keys were produced
(since Exploder 2 generates more search keys than Exploder 4 from the same positions we have
an increase in the time between the two algorithms). In addition, we confirm our assumptions
about having reads with lower quality for the spike seen in Figures 4.2 to 4.6 concerning Ex-
ploder 2.
We also analysed the coverage obtained and other parameters related with mapping, but
since with real data the goal is to find the true location of the reads within the genome precision
could not be evaluated (Figure 4.20). In terms of coverage, exploding the search key number im-
proves the chance of mapping every read, as we can see with Exploder 3 and GNUMAP-based
algorithms, which let to a coverage of 100%. Despite the number of multiple positions found, the
tool just succeed at finding the best and other possible locations for the reads if the GNUMAP-
based algorithm was chosen for the Exploder component. As seen before (Figure 4.16) we ob-
tained more reads with other possible positions with the NW plus SM strategy for the Aligner,
reaching 11% of the reads set; although, in this case, this strategy also improved the number
of the reads finding a best location to 28%. Since these results were obtained considering a
base similarity matrix, reads with genetic variations (such as single nucleotide polymorphisms
(SNPs)) did not have their alignment score penalised and may have been mapped to their true
location.
Concerning the results presented in Figure 4.20, we assume the key size (k) plays an im-
portant role if we just consider a subsequence of the read to create new keys. A relevant pro-
portion of reads were mapped to multiple locations, but since the alignment score did not ex-
64
4.4. Escherichia coli UTI89
3 3 3 39 3
474
1838 1838 1844
2000
1841
2000
0 0 0 2 0
172
1941 1941 1945 2000 1944 2000 N
um
be
r o
f R
ead
s
NW
5 5 5 68
5
568
1838 1838 1844
2000
1841
2000
0 0 0 3 0
215
1941 1941 1945 2000 1944 2000
Nu
mb
er
of
Re
ads
NW plus SM
3 3 3 35 3
430
1838 1838 1844
2000
1841
2000
0 0 0 0 0 104
1941 1941 1945 2000 1944 2000
Nu
mb
er
of
Re
ads
GNUMAP-based NW
Best Location
Multiple Locations
Possible Locations
Mapped
Figure 4.20: Mapping Results for E. coli UTI89. The number of reads for which was found morethan one location is under Multiple Locations of them Possible Locations are those that scoredover 0.85 at the alignment; if only one location with a score higher than 0.85 was found this isthe Best Location. Since it is a real dataset, we also show the number of reads Mapped to E.coli UTI89 genome. The horizontal axis represent each combination between the generatingsearch keys algorithms (Exploders 00, 0, 2, 3 and 4 and the GNUMAP-based algorithm) and ourvariants of the Needleman-Wunsch method (NW — simple implementation —, NW plus SM —with a Similarity Matrix —, and GNUMAP-based NW). Results for 2000 read reads.
65
4. Results and Discussion
ceeded the 0.85 threshold virtually no read was considered effectively mapped to a best and
possible locations. On the other hand, using the entire read to create new keys, as performed by
the GNUMAP-based strategy, instead of just permuting the bases within a k-sized subsequence,
helps to find locations with relevant scores. However, 2000 reads were randomly selected from a
bigger dataset, until we improve our tool scalability and test the algorithms with all the data we
have no way of draw conclusions regarding performance with real data.
66
Chapter 5
Conclusions and Future Work
To overcome the challenges brought by the data produced with NGS technologies, we developed
the read mapping tool presented in Chapter 3. Our approach followed the ’seed and extend’
paradigm, hence, the reference genome is hashed and multiple search keys for each read are
used to find the genomic candidate locations. To find this locations, we implemented four al-
gorithms to generate multiple search keys for one read, that take into account the Phred quality
values and/or nitrogenous base similarity, and one that splits the read into overlapping sub-
sequences with equal sizes, as in GNUMAP (Clement et al., 2010). For the extension of the seeds,
we implemented a simple version of the Needleman-Wunsch method (Needleman and Wunsch,
1970) to align the read with a region of the genome, a version where a similarity matrix is used
to score the matchings between the sequences, and the variant used at GNUMAP (Clement et
al., 2010).
Based on a pipeline, our tool stands on the paradigm of modular programming enabling to
plug various algorithmic combinations of the Exploder, the component of the pipeline respons-
ible of creating the search keys for the reads, and the module in which the alignment between
the read and the reference sequence occurs (Aligner). The possible combinations were eval-
uated in terms of scalability, coverage and precision, with simulated datasets of sizes ranging
between 1000 and 3000 reads from Mus musculus, allowing us to infer that our tool cannot keep
up with the throughput currently obtained with NGS platforms. However, it performed as ex-
67
5. Conclusions and Future Work
pected in terms of coverage having mapped all the reads from the datasets. The precision of
our tool is highly related to the exploding keys algorithm used, and it was flawless with the
GNUMAP-based one, which explodes the keys by passing a sliding window along a read and
retrieve k-sized subsequences as keys.
Moreover, despite generating more search keys, thus increasing the search space within
the reference sequence, Exploder 3 does not lead to better results when finding the best po-
sition for a read nor other possible positions, meaning it requires a runtime too high for the
results produced. Despite our read mapping pipeline requiring more time to be executed when
resorts to the NW plus SM implementation for the Aligner component it allowed to find more
possible positions for the reads, especially if combined with the GNUMAP-based exploding keys
algorithm, with a 6% increase. About finding the best location to map a read, the combination
between the GNUMAP-based implementation of the Needleman-Wunsch method, to align the
nucleotide sequences, with the GNUMAP algorithm, for the Exploder module, performed bet-
ter; for instance, it led to 94% of the 3000 reads set to discover it. Although, further tests with
other datasets of different sizes would provide more understanding about the effect of the Ex-
ploder 2 at this task, using that same alignment method.
To sum up, given the various combinations, when relying on the GNUMAP-based ap-
proach for the Aligner module we obtain the best results in terms of scalability; if we also gener-
ate the keys by the GNUMAP algorithm we get more reads finding their best position and better
precision. By combining this exploding algorithm with the NW plus SM at the Aligner more
possible locations are found for the reads within each dataset. However, mapping reads to mul-
tiple locations can lead to false detection of genetic variations, due to repetitive DNA fragments
at the reference genome. On the other hand, if we map every read within a set and report their
multiple locations we will have more certainty in the consensus sequence of the sample and at
finding SNPs, as an example.
We also mapped 2000 real reads from Escherichia coli UTI89, which has a smaller gen-
ome than Mus musculus, confirming our observations on scalability. Although, the read size
has a clear influence on the time required to create the search keys with the GNUMAP-based
algorithm. However, the best and other genomic locations for the reads were only discovered
68
5.1. Future Work
when the GNUMAP algorithm was chosen for the Exploder. The better results were obtained
when combined with NW plus SM to do the alignment, resulting in 28% of the reads finding
their best position and 11% being mapped to other possible locations. As for the coverage, ex-
ploding the search keys number with Exploder 3 and the GNUMAP-based strategy improved
the mapped reads proportion to 100%.
Our tool was implement in Java, so it can run on all major operating systems, e.g., Win-
dows, Linux, and Mac OS, with Java runtime environments installed, and the source code is
available in a public repository1.
5.1 Future Work
Further studies to compare the implemented versions of the Needleman-Wunsch method will
perhaps explain the lower runtime for the GNUMAP-based one. Future work must include im-
proving the scalability of our tool, and mapping all the reads of a NGS dataset to a complete
mammal genome, such as from the Mus musculus and the human, which may require explore
other alignment options. An advantage of our modular approach is the simple implementation
of new algorithms to perform specific tasks, like the alignment, without compromising the rest
of the pipeline. This way we can the try the alignment algorithm proposed by Chakraborty and
Bandyopadhyay (2013) — FOGSAA: Fast Optimal Global Sequence Alignment Algorithm, a tree-
based algorithm that claims to obtain the same results as the Needleman-Wunsch method, but
much faster — and/or see the effect of the adaptive seeds strategy from the work of Kiełbasa
et al. (2011). Once we have controlled the scalability issue we can investigate strategies to map
longer reads, as the ones promised by the third generation sequencing technologies (Wang et al.,
2013). Cloud computing can also improve scalability by allowing executions to span across an
arbitrary number of machines, and for this the ApacheTM frameworks Hadoop® (The Apache
Software Foundation, 2015b) and SparkTM (The Apache Software Foundation, 2015a) are great
candidate solutions.
Another improvement that we foresee, is to store the read alignments against reference
1https://github.com/NatachaPL/LLC-Read-Mapping-Pipeline.git
69
5. Conclusions and Future Work
sequences in the Sequence Alignment/Map (SAM) format, a generic alignment format that sup-
ports short and long reads (up to 128 Mbp) produced by different sequencing platforms (Li et al.,
2009a). Today, various aligners2, that read FASTQ files and assign the sequences to a position
from a reference sequence, output this simple and flexible format. SAM format current defini-
tion is at http://samtools.github.io/hts-specs/SAMv1.pdf.
Paired-end reads, i.e. reads sequenced from both ends of the same DNA fragment, can be
produced by a variety of sequencing protocols with a preparation specific to a given sequencing
technology (Treangen and Salzberg, 2012). The mapping of these reads requires a maximum dis-
tance between them, adding a constraint when finding their genomic locations, consequently
a repetitive read will be reliable mapped if its pair can be mapped unambiguously. Moreover,
paired-end alignments outperforms single-end alignments in terms of both sensitivity and spe-
cificity (Li and Homer, 2010). Hence, to adapt our tool to paired-end reads would be a valuable
improvement. As well as, analyse data generated by SOLiD sequencers, color space reads, would
be an important extension to fulfil our goal of creating a tool able to map reads from every NGS
platform. In SOLiD platforms overlapping pairs of letters are read and given digits ranging from
0 to 3 to encode the colour calls (base transitions) (Rumble et al., 2009); to record this reads with
its quality information the Color Space FASTQ (CSFASTQ) files were created (Cock et al., 2010).
The reads can be converted into bases, as presented in FASTQ files, but performing the mapping
with the color space has advantages regarding error detection.
Furthermore, the scope of applications would be broaden if we add to our tool an al-
gorithm to map reads from sequencing coupled to bisulfite conversion (Bisulfite-seq), enabling
genome-wide measurement of DNA methylation, (Kunde-Ramamoorthy et al., 2014). And, al-
though we refer our concern with mapping a read to multiple locations due to repetitive ge-
nomic regions, the analysis of data from sequencing coupled to chromatin immunoprecipit-
ation (ChIP-Seq) relies in finding regions enriched with reads; thus, mappers that consider a
read must be uniquely placed will not be up to this task (Newkirk et al., 2011). Therefore, for
future work we could consider extend our pipeline regarding data from different sequencing
techniques.
2http://seqanswers.com/wiki/SAM
70
Bibliography
Abu-Doleh, A., Saule, E., Kaya, K., and Çatalyürek, Ü. V. (2013). Masher: Mapping Long(er) Reads
with Hash-based Genome Indexing on GPUs. In Proceedings of the International Conference
on Bioinformatics, Computational Biology and Biomedical Informatics, page 341. ACM.
AdelsonVelskii, M. and Landis, E. M. (1963). An algorithm for the organization of information.
Technical report, DTIC Document.
Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., and Xie, X. (2012). Hobbes: optimized
gram-based methods for efficient read alignment. Nucleic Acids Research, 40(6):e41.
Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman,
J. O., Baker, C., Malig, M., Mutlu, O., et al. (2009). Personalized copy number and segmental
duplication maps using next-generation sequencing. Nature Genetics, 41(10):1061–1067.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment
search tool. Journal of Molecular Biology, 215(3):403–410.
Anderson, S. (1981). Shotgun DNA sequencing using cloned DNase I-generated fragments. Nuc-
leic Acids Research, 9(13):3015–3027.
Avery, O. T., MacLeod, C. M., and McCarty, M. (1944). Studies on the chemical nature of the
substance inducing transformation of pneumococcal types induction of transformation by a
desoxyribonucleic acid fraction isolated from pneumococcus type III. The Journal of Experi-
mental Medicine, 79(2):137–158.
Baeza-Yates, R. A. and Perleberg, C. H. (1992). Fast and practical approximate string matching.
In Combinatorial Pattern Matching, pages 185–192. Springer.
BioJava (2015). BioJava:CookBook3:FASTQ. <http://biojava.org/wiki/BioJava:CookBook3:
FASTQ#Convert_between_FASTQ_variants> Accessed 12.August.2015.
71
Bibliography
BioPerl (2015). FASTQ sequence format. <http://www.bioperl.org/wiki/FASTQ_sequence_
format> Accessed 12.August.2015.
Biopython (2015). SeqIO. <http://biopython.org/wiki/SeqIO#File_Formats> Accessed 12.Au-
gust.2015.
BioRuby (2015). Module: Bio::Sequence::QualityScore::Converter. <http://www.rubydoc.
info/github/aunderwo/bioruby/Bio/Sequence/QualityScore/Converter> Accessed 12.Au-
gust.2015.
Bohlander, S. K. (2013). ABCs of genomics. ASH Education Program Book, 2013(1):316–323.
Bravo, H. C. and Irizarry, R. A. (2010). Model-based quality assessment and base-calling for
second-generation sequencing data. Biometrics, 66(3):665–674.
Burrows, M. and Wheeler, D. J. (1994). A block-sorting loss-less data compression algorithm.
SRC Research Report, 124.
Chakraborty, A. and Bandyopadhyay, S. (2013). FOGSAA: Fast optimal global sequence align-
ment algorithm. Scientific Reports, 3.
Chen, S. L., Hung, C.-S., Xu, J., Reigstad, C. S., Magrini, V., Sabo, A., Blasiar, D., Bieri, T., Meyer,
R. R., Ozersky, P., et al. (2006). Identification of genes subject to positive selection in uro-
pathogenic strains of Escherichia coli: a comparative genomics approach. Proceedings of the
National Academy of Sciences, 103(15):5977–5982.
Chen, Y., Schmidt, B., and Maskell, D. L. (2013). A hybrid short read mapping accelerator. BMC
Bioinformatics, 14(1):67.
Chung, W.-C., Chen, C.-C., Ho, J.-M., Lin, C.-Y., Hsu, W.-L., Wang, Y.-C., Lee, D., Lai, F., Huang,
C.-W., and Chang, Y.-J. (2014). CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds
and Analyzing High-Throughput Sequencing Data with MapReduce. PLoS ONE, 9(6).
Clement, N. L., Snell, Q., Clement, M. J., Hollenhorst, P. C., Purwar, J., Graves, B. J., Cairns, B. R.,
and Johnson, W. E. (2010). The GNUMAP algorithm: unbiased probabilistic mapping of oli-
gonucleotides from next-generation sequencing. Bioinformatics, 26(1):38–45.
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2010). The Sanger FASTQ file
format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic
Acids Research, 38(6):1767–1771.
Cohen, J. S. and Portugal, H. (1974). The Search for the chemical structure of DNA. Connecticut
72
Bibliography
Medecine, 38:551–557.
Collins, D. W. and Jukes, T. H. (1994). Rates of transition and transversion in coding sequences
since the human-rodent divergence. Genomics, 20(3):386–396.
Collins, F. S., Lander, E. S., Rogers, J., and Waterson, R. H. (2004). Finishing the euchromatic
sequence of the human genome. Nature, 431(7011):931–945.
Collins, F. S., Morgan, M., and Patrinos, A. (2003). The Human Genome Project: lessons from
large-scale biology. Science, 300(5617):286–290.
Collins, F. S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., Walters, L., et al. (1998). New
goals for the US Human Genome Project: 1998-2003. Science, 282(5389):682–689.
Crick, F., Barnett, L., Brenner, S., and Watts-Tobin, R. (1961). General nature of the genetic code
for proteins. Nature, 192:1227.
Crick, F. et al. (1970). Central Dogma of Molecular Biology. Nature, 227(5258):561–563.
Dahm, R. (2010). From discovering to understanding. EMBO reports, 11(3):153–160.
David, M., Dzamba, M., Lister, D., Ilie, L., and Brudno, M. (2011). SHRiMP2: sensitive yet prac-
tical short read mapping. Bioinformatics, 27(7):1011–1012.
Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. (2008). Substantial biases in
ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research,
36(16):e105–e105.
EMBOSS (2015). Sequence Formats. <http://emboss.sourceforge.net/docs/themes/
SequenceFormats.html> Accessed 12.August.2015.
Encodeproject.org (2015). ENCODE: Encyclopedia of DNA Elements. <https://www.
encodeproject.org/> Accessed 16.August.2015.
Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. In Found-
ations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390–398.
IEEE.
Ferragina, P. and Mishra, B. B. (2014). Algorithms in stringomics (i): Pattern-matching against"
stringomes". bioRxiv, page 001669.
Fonseca, N. A., Rung, J., Brazma, A., and Marioni, J. C. (2012). Tools for mapping high-
throughput sequencing data. Bioinformatics, page bts605.
Frampton, M. and Houlston, R. (2012). Generation of Artificial FASTQ Files to Evaluate the Per-
73
Bibliography
formance of Next-Generation Sequencing Pipelines. PLoS ONE, 7(11).
Frazer, K. A. (2012). Decoding the human genome. Genome Research, 22(9):1599–1601.
Freese, E. (1959). The difference between spontaneous and base-analogue induced mutations
of phage T4. Proceedings of the National Academy of Sciences of the United States of America,
45(4):622.
Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecu-
lar Biology, 162(3):705–708.
Green, E. D., Guyer, M. S., Institute, N. H. G. R., et al. (2011). Charting a course for genomic
medicine from base pairs to bedside. Nature, 470(7333):204–213.
Griffith, F. (1928). The significance of pneumococcal types. Journal of Hygiene, 27(02):113–159.
Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E. E., and Sahinalp, S. C.
(2010). mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods,
7(8):576–577.
Hach, F., Sarrafi, I., Hormozdiari, F., Alkan, C., Eichler, E. E., and Sahinalp, S. C. (2014). mrsFAST-
Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic
Acids Research, page gku370.
Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System technical
journal, 29(2):147–160.
Hatem, A., Bozdag, D., Toland, A. E., and Çatalyürek, Ü. V. (2013). Benchmarking short sequence
mapping tools. BMC Bioinformatics, 14(1):184.
Hershey, A. D. and Chase, M. (1952). Independent functions of viral protein and nucleic acid in
growth of bacteriophage. The Journal of General Physiology, 36(1):39–56.
Hieu Tran, N. and Chen, X. (2015). AMAS: optimizing the partition and filtration of adaptive
seeds to speed up read mapping. arXiv preprint arXiv:1502.05041.
Holtgrewe, M., Emde, A.-K., Weese, D., and Reinert, K. (2011). A novel and well-defined bench-
marking method for second generation read mapping. BMC Bioinformatics, 12(1):210.
Huang, Y.-F., Chen, S.-C., Chiang, Y.-S., Chen, T.-H., and Chiu, K.-P. (2012). Palindromic se-
quence impedes sequencing-by-ligation mechanism. BMC Systems Biology, 6(Suppl 2):S10.
Hyyrö, H. (2003). A bit-vector algorithm for computing Levenshtein and Damerau edit dis-
tances. Nordic Journal of Computing, 10(1):29–39.
74
Bibliography
Illumina, Inc. (2015). Sequencing Platform Comparison Tool. <https://www.illumina.com/
systems/sequencing-platform-comparison.html> Accessed 20.August.2015.
Jiang, H. and Wong, W. H. (2008). SeqMap: mapping massive amount of oligonucleotides to the
genome. Bioinformatics, 24(20):2395–2396.
Kemp, M. (2003). The Mona Lisa of modern science. Nature, 421(6921):416–420.
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., and Frith, M. C. (2011). Adaptive seeds tame gen-
omic sequence comparison. Genome Research, 21(3):487–493.
Kim, J., Li, C., and Xie, X. (2014). Improving read mapping using additional prefix grams. BMC
Bioinformatics, 15(1):42.
Kunde-Ramamoorthy, G., Coarfa, C., Laritsky, E., Kessler, N. J., Harris, R. A., Xu, M., Chen, R.,
Shen, L., Milosavljevic, A., and Waterland, R. A. (2014). Comparison and quantitative verific-
ation of mapping algorithms for whole-genome bisulfite sequencing. Nucleic Acids Research,
42(6):e43–e43.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar,
K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human gen-
ome. Nature, 409(6822):860–921.
Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature
Methods, 9(4):357–359.
Langmead, B., Schatz, M. C., Lin, J., Pop, M., and Salzberg, S. L. (2009a). Searching for SNPs with
cloud computing. Genome Biology, 10(11):R134.
Langmead, B., Trapnell, C., Pop, M., Salzberg, S. L., et al. (2009b). Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25.
Ledergerber, C. and Dessimoz, C. (2011). Base-calling for next-generation sequencing plat-
forms. Briefings in Bioinformatics, page bbq077.
Lee, W.-P., Stromberg, M. P., Ward, A., Stewart, C., Garrison, E. P., and Marth, G. T. (2014). MO-
SAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Map-
ping. PLoS ONE, 9:e90581.
Levene, P. and London, E. (1929). The structure of thymonucleic acid. Journal of Biological
Chemistry, 83(3):793–802.
Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals.
75
Bibliography
In Soviet Physics Doklady, volume 10, page 707.
Li, H. (2012). Exploring single-sample SNP and INDEL calling with whole-genome de novo as-
sembly. Bioinformatics, 28(14):1838–1844.
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.
arXiv preprint arXiv:1303.3997.
Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler
transform. Bioinformatics, 25(14):1754–1760.
Li, H. and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows–Wheeler
transform. Bioinformatics, 26(5):589–595.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.,
Durbin, R., et al. (2009a). The Sequence Alignment/Map format and SAMtools. Bioinformat-
ics, 25(16):2078–2079.
Li, H. and Homer, N. (2010). A survey of sequence alignment algorithms for next-generation
sequencing. Briefings in Bioinformatics, 11(5):473–483.
Li, H., Ruan, J., and Durbin, R. (2008a). Mapping short DNA sequencing reads and calling vari-
ants using mapping quality scores. Genome Research, 18(11):1851–1858.
Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008b). SOAP: short oligonucleotide alignment pro-
gram. Bioinformatics, 24(5):713–714.
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009b). SOAP2: an
improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967.
Liu, C.-M., Wong, T., Wu, E., Luo, R., Yiu, S.-M., Li, Y., Wang, B., Yu, C., Chu, X., Zhao, K., et al.
(2012a). SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics,
28(6):878–879.
Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L., and Law, M. (2012b). Comparison of
Next-Generation Sequencing Systems. Journal of Biomedicine and Biotechnology, 2012.
Liu, Y., Popp, B., and Schmidt, B. (2014). CUSHAW3: Sensitive and Accurate Base-Space and
Color-Space Short-Read Alignment with Hybrid Seeding. PLoS ONE, 9(1).
Liu, Y. and Schmidt, B. (2012). Long read alignment based on maximal exact match seeds. Bioin-
formatics, 28(18):i318–i324.
Liu, Y., Schmidt, B., and Maskell, D. L. (2012c). CUSHAW: a CUDA compatible short read aligner
76
Bibliography
to large genomes based on the Burrows–Wheeler transform. Bioinformatics, 28(14):1830–
1837.
Luo, R., Wong, T., Zhu, J., Liu, C., Zhu, X., Leung, F. C., et al. (2013). SOAP3-dp: Fast, Accurate
and Sensitive GPU-Based Short Read Aligner. PLoS ONE, 8(5):e65632.
Manber, U. and Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam
Journal on Computing, 22(5):935–948.
Maxam, A. M. and Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the
National Academy of Sciences, 74(2):560–564.
McPherson, J. D. (2014). A defining decade in DNA sequencing. Nature Methods, 11(10):1003–
1005.
Mendel, G. (1866). Versuche über Pflanzenhybriden. Verhandlungen des naturforschenden Ver-
eines in Brunn 4: 3, 44.
Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature Reviews Genetics,
11(1):31–46.
Minoche, A. E., Dohm, J. C., Himmelbauer, H., et al. (2011). Evaluation of genomic high-
throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems.
Genome Biology, 12(11):R112.
Myers, G. (1999). A fast bit-vector algorithm for approximate string matching based on dynamic
programming. Journal of the ACM (JACM), 46(3):395–415.
NCBI (2015). SRA Toolkit Documentation. <http://www.ncbi.nlm.nih.gov/Traces/sra/?view=
toolkit_doc&f=fastq-dump> Accessed 12.August.2015.
Needleman, S. B. and Wunsch, C. D. (1970). A General Method Applicable to the Search for Sim-
ilarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3):443–
453.
Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., and Xie, X. (2011). AREM: Aligning Short
Reads from ChIP-Sequencing by Expectation Maximization. In Research in Computational
Molecular Biology, pages 283–297. Springer.
Nguyen, T., Shi, W., and Ruden, D. (2011). CloudAligner: A fast and full-featured MapReduce
based tool for sequence mapping. BMC Research Notes, 4(1):171.
Nirenberg, M., Leder, P., Bernfield, M., Brimacombe, R., Trupin, J., Rottman, F., and O’neal, C.
77
Bibliography
(1965). RNA codewords and protein synthesis, VII. On the general nature of the RNA code.
Proceedings of the National Academy of Sciences of the United States of America, 53(5):1161.
Nobelprize.org (2015a). The Nobel Prize in Chemistry 1980. <http://www.nobelprize.org/
nobel_prizes/chemistry/laureates/1980/> Accessed 16.August.2015.
Nobelprize.org (2015b). The Nobel Prize in Physiology or Medicine 1962. <http://www.
nobelprize.org/nobel_prizes/medicine/laureates/1962> Accessed 20.March.2015.
Nordberg, H., Bhatia, K., Wang, K., and Wang, Z. (2013). BioPig: a Hadoop-based analytic toolkit
for large-scale sequence data. Bioinformatics, page btt528.
O|B|F (2015). Open Bioinformatics Foundation. <http://www.open-bio.org/> Accessed 12.Au-
gust.2015.
O’Driscoll, A., Daugelaite, J., and Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing
in genomics. Journal of Biomedical Informatics, 46(5):774–781.
Offit, K. (2014). Decade in review – genomics: A decade of discovery in cancer genomics. Nature
Reviews Clinical Oncology, 11(11):632–634.
Olson, M. (2010). HADOOP: Scalable, Flexible Data Storage and Analysis. IQT Quart, 1(3):14–18.
Onsongo, G., Erdmann, J., Spears, M. D., Chilton, J., Beckman, K. B., Hauge, A., Yohe, S., Scho-
maker, M., Bower, M., Silverstein, K. A., et al. (2014). Implementation of Cloud based Next
Generation Sequencing data analysis in a clinical laboratory. BMC Research Notes, 7(1):314.
Oracle Corporation (2015a). Class AbstractMap.SimpleEntry<K,V>. <http://docs.oracle.com/
javase/7/docs/api/java/util/AbstractMap.SimpleEntry.html> Accessed 27.August.2015.
Oracle Corporation (2015b). Class ArrayList<E>. <https://docs.oracle.com/javase/7/docs/api/
java/util/ArrayList.html> Accessed 27.August.2015.
Oracle Corporation (2015c). Class Class<T>. <http://docs.oracle.com/javase/7/docs/api/java/
lang/Class.html> Accessed 5.September.2015.
Oracle Corporation (2015d). Class Constructor<T>. <https://docs.oracle.com/javase/7/docs/
api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object...)> Accessed 28.Au-
gust.2015.
Oracle Corporation (2015e). Class HashMap<K,V>. <http://docs.oracle.com/javase/7/docs/
api/java/util/HashMap.html> Accessed 28.August.2015.
Oracle Corporation (2015f). Class HashSet<E>. <http://docs.oracle.com/javase/7/docs/api/
78
Bibliography
java/util/HashSet.html> Accessed 2.September.2015.
Oracle Corporation (2015g). Class Object. <http://docs.oracle.com/javase/7/docs/api/java/
lang/Object.html> Accessed 5.September.2015.
Oracle Corporation (2015h). Class Properties. <http://docs.oracle.com/javase/7/docs/api/
java/util/Properties.html> Accessed 4.September.2015.
Oracle Corporation (2015i). Class StringBuffer. <http://docs.oracle.com/javase/7/docs/api/
java/lang/StringBuffer.html> Accessed 28.August.2015.
Oracle Corporation (2015j). Class TreeSet<E>. <http://docs.oracle.com/javase/7/docs/api/
java/util/TreeSet.html> Accessed 5.September.2015.
Oracle Corporation (2015k). Interface Comparable<T>. <http://docs.oracle.com/javase/7/
docs/api/java/lang/Comparable.html> Accessed 5.September.2015.
Oracle Corporation (2015l). Interface Runnable. <https://docs.oracle.com/javase/7/docs/api/
java/lang/Runnable.html> Accessed 4.September.2015.
Oracle Corporation (2015m). The JavaTM Tutorials - Abstract Methods and Classes. <https://
docs.oracle.com/javase/tutorial/java/IandI/abstract.html> Accessed 17.August.2015.
Oracle Corporation (2015n). The JavaTM Tutorials - Classes. <https://docs.oracle.com/javase/
tutorial/java/javaOO/classes.html> Accessed 17.August.2015.
Oracle Corporation (2015o). The JavaTM Tutorials - Creating and Using Packages. <https://docs.
oracle.com/javase/tutorial/java/package/packages.html> Accessed 28.August.2015.
Oracle Corporation (2015p). The JavaTM Tutorials - Lesson: A Closer Look at the "Hello World!"
Application. <https://docs.oracle.com/javase/tutorial/getStarted/application/#MAIN> Ac-
cessed 28.August.2015.
Oracle Corporation (2015q). The JavaTM Tutorials - Objects. <https://docs.oracle.com/javase/
tutorial/java/javaOO/objects.html> Accessed 17.August.2015.
Oracle Corporation (2015r). The JavaTM Tutorials - Thread Pools. <http://docs.oracle.com/
javase/tutorial/essential/concurrency/pools.html> Accessed 4.September.2015.
O’Rawe, J., Jiang, T., Sun, G., Wu, Y., Wang, W., Hu, J., Bodily, P., Tian, L., Hakonarson, H., John-
son, W. E., et al. (2013). Low concordance of multiple variant-calling pipelines: practical im-
plications for exome and genome sequencing. Genome Medicine, 5(3):28.
Pak, T. and Kasarskis, A. (2015). How next-generation sequencing and multiscale data analysis
79
Bibliography
will transform infectious disease management. Clinical Infectious Diseases, page civ670.
Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences, 85(8):2444–2448.
Pettersson, E., Lundeberg, J., and Ahmadian, A. (2009). Generations of sequencing technologies.
Genomics, 93(2):105–111.
Prober, J. M., Trainor, G. L., Dam, R. J., Hobbs, F. W., Robertson, C. W., Zagursky, R. J., Cocuzza,
A. J., Jensen, M. A., and Baumeister, K. (1987). A system for rapid DNA sequencing with fluor-
escent chain-terminating dideoxynucleotides. Science, 238(4825):336–341.
Quail, M. A., Smith, M., Coupland, P., Otto, T. D., Harris, S. R., Connor, T. R., Bertoni, A., Swerdlow,
H. P., and Gu, Y. (2012). A tale of three next generation sequencing platforms: comparison of
Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics, 13(1):341.
Reinert, K., Langmead, B., Weese, D., and Evers, D. J. (2015). Alignment of Next-Generation
Sequencing Reads. Annual Review of Genomics and Human Genetics.
Roberts, A., Feng, H., and Pachter, L. (2013). Fragment assignment in the cloud with eXpress-D.
BMC Bioinformatics, 14(1):358.
Roche Diagnostics Corporation (2015). 454 Products. <http://454.com/products/index.asp>
Accessed 20.August.2015.
Ross, J. S. and Cronin, M. (2011). Whole Cancer Genome Sequencing by Next-Generation Meth-
ods. American Journal of Clinical Pathology, 136(4):527–539.
Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., and Brudno, M. (2009). SHRiMP: accur-
ate mapping of short color-space reads. PLoS Computational Biology, 5(5):e1000386.
Sanger, F., Air, G., Barrell, B., Brown, N., Coulson, A., Fiddes, J., Hutchison, C., Slocombe,
P., and Smith, M. (1977a). Nucleotide sequence of bacteriophage ϕX174 DNA. Nature,
265(5596):687–695.
Sanger, F., Nicklen, S., and Coulson, A. R. (1977b). DNA sequencing with chain-terminating
inhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467.
Schulz, M. H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., and Richard, H. (2014).
Fiona: a parallel and automatic strategy for read error correction. Bioinformatics, 30(17):i356–
i363.
Shendure, J. and Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology,
80
Bibliography
26(10):1135–1145.
Siragusa, E., Weese, D., and Reinert, K. (2013). Fast and accurate read mapping with approximate
seeds and multiple backtracking. Nucleic Acids Research, 41(7):e78.
Smith, A. D., Chung, W.-Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J., Xuan, Z., and
Zhang, M. Q. (2009). Updates to the RMAP short-read mapping software. Bioinformatics,
25(21):2841–2842.
Smith, A. D., Xuan, Z., and Zhang, M. Q. (2008). Using quality scores and longer reads improves
accuracy of Solexa read mapping. BMC Bioinformatics, 9(1):128.
Smith, L. M., Sanders, J. Z., Kaiser, R. J., Hughes, P., Dodd, C., Connell, C. R., Heiner, C., Kent,
S. B., and Hood, L. E. (1986). Fluorescence detection in automated DNA sequence analysis.
Nature, 321(6071):674–679.
Smith, T. F. and Waterman, M. S. (1981). Identification of Common Molecular Subsequences.
Journal of Molecular Biology, 147(1):195–197.
Staden, R. (1979). A strategy of DNA sequencing employing computer programs. Nucleic Acids
Research, 6(7):2601–2610.
The Apache Software Foundation (2015a). Apache SparkTM. <http://spark.apache.org/> Ac-
cessed 17.September.2015.
The Apache Software Foundation (2015b). ApacheTM Hadoop®. <http://hadoop.apache.org/>
Accessed 17.September.2015.
The New York Times (2007). Statement by James D. Watson. <http://www.nytimes.com/2007/
10/25/science/26wattext.html?_r=0> Accessed 20.August.2015.
Thermo Fisher Scientific Inc. (2015). SOLiD® Next-Generation Sequencing. <https://www.
thermofisher.com/pt/en/home/life-science/sequencing/next-generation-sequencing/
solid-next-generation-sequencing.html> Accessed 20.August.2015.
Treangen, T. J. and Salzberg, S. L. (2012). Repetitive DNA and next-generation sequencing: com-
putational challenges and solutions. Nature Reviews Genetics, 13(1):36–46.
Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell,
M., Evans, C. A., Holt, R. A., et al. (2001). The sequence of the human genome. Science,
291(5507):1304–1351.
Wang, Q., Xia, J., Jia, P., Pao, W., and Zhao, Z. (2013). Application of next generation sequencing
81
Bibliography
to human gene fusion detection: computational tools, features and perspectives. Briefings in
Bioinformatics, 14(4):506–519.
Watson, J. D. (1990). The Human Genome Project: Past, Present, and Future. Science,
248(4951):44–49.
Watson, J. D. and Crick, F. H. C. (1953a). Molecular Structure of Nucleic Acids. Nature,
171(4356):737–738.
Watson, J. D. and Crick, F. H. C. (1953b). The structure of DNA. In Cold Spring Harbor Symposia
on Quantitative Biology, volume 18, pages 123–131. Cold Spring Harbor Laboratory Press.
Weese, D., Emde, A.-K., Rausch, T., Döring, A., and Reinert, K. (2009). RazerS—fast read mapping
with sensitivity control. Genome Research, 19(9):1646–1654.
Weese, D., Holtgrewe, M., and Reinert, K. (2012). RazerS 3: faster, fully sensitive read mapping.
Bioinformatics, 28(20):2592–2599.
Wetterstrand, K. A. (2015). DNA Sequencing Costs: Data from the NHGRI Genome Sequencing
Program (GSP). <www.genome.gov/sequencingcosts> Accessed 15.July.2015.
Wiewiórka, M. S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., and Okoniewski,
M. J. (2014). SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data
analysis with nucleotide precision. Bioinformatics, 30(18):2652–2653.
Wilton, R., Budavari, T., Langmead, B., Wheelan, S. J., Salzberg, S. L., and Szalay, A. S. (2015).
Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-
extend search space. PeerJ, 3:e808.
Yang, X., Chockalingam, S. P., and Aluru, S. (2013). A survey of error-correction methods for
next-generation sequencing. Briefings in Bioinformatics, 14(1):56–66.
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: cluster
computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in
cloud computing, volume 10, page 10.
Zhang, J., Chiodini, R., Badr, A., and Zhang, G. (2011). The impact of next-generation sequencing
on genomics. Journal of Genetics and Genomics, 38(3):95–109.
Zhao, G., Ling, C., and Sun, D. (2015). SparkSW: Scalable Distributed Computing System for
Large-Scale Biological Sequence Alignment. In Cluster, Cloud and Grid Computing (CCGrid),
2015 15th IEEE/ACM International Symposium on, pages 845–852. IEEE.
82