Download pdf - Development of a Scalable, Precise, and High-Coverage ...€¦ · UNIVERSIDADE DE LISBOA FACULDADE DE CIÊNCIAS DEPARTAMENTO DE INFORMÁTICA Development of a Scalable, Precise, and

UNIVERSIDADE DE LISBOA

FACULDADE DE CIÊNCIAS

DEPARTAMENTO DE INFORMÁTICA

Development of a Scalable, Precise, and

High-Coverage Genomics Mapping Tool for NGS

Natacha Alexandra Pinheiro Leitão

Orientação: Professor Doutor Francisco José Moreira Couto

e Professor Doutor João Carlos Antunes Leitão

MESTRADO EM BIOINFORMÁTICA E BIOLOGIA COMPUTACIONAL

Especialização em BioinformáticaDissertação

2015

"Science, my lad, is made up of mistakes, but they are mistakes which it is useful

to make, because they lead little by little to the truth."

- Jules Verne in "A Journey to the Center of the Earth"

Agradecimentos

Em primeiro lugar, tenho a agradecer ao Professor Doutor Francisco Couto, que não só

aceitou orientar-me nos meus trabalhos de mestrado, como me inspirou a arriscar, a pôr à prova

a minha capacidade de aprendizagem e adaptação, e a desenvolver o meu sentido de investiga-

ção. Ao Doutor João Leitão, cuja co-orientação foi fundamental na concretização deste projeto,

agradeço o seu companheirismo e paciência infinita. E, claro, estou muito grata aos dois pela

motivação, confiança e disponibilidade com que me encaminharam nesta jornada que, apesar

de morosa e frustrante (em alguns momentos), me deu tanto a aprender. Acho que me vou

lembrar de muito do que me disseram e que com eles aprendi durante toda a minha vida.

Agradeço à minha família pelo seu amor incondicional, apoio e orgulho em mim, por acre-

ditarem no meu sucesso, independentemente das minhas escolhas e terem-me permitido e mo-

tivado a chegar até aqui e a ir cada vez mais longe. Agradeço especialmente: à minha Mãe, por

todas as lágrimas que enxaguou, pelos abraços cheios de mimo, sobretudo nos momentos em

que a confiança me falhou (quase completamente) e pelo espaço e tempo que me deu; à mi-

nha irmã Sofia, por me aturar longe e perto, levar a desanuviar e a divertir (como só consigo

com ela), ajudar-me a ir atrás dos meus objetivos e por ser "A"minha irmã mais velha; ao meu

Pai, que mesmo distante esteve presente, por todas as conversas cheias de carinho e por me ter

ensinado que "a máquina tem sempre razão"(cresci a ouvi-lo e isso ajudou-me bastante neste

desafio); à minha Avó Celeste pela visita, pela companhia na minha ida ao Porto para apresen-

tar o meu trabalho, e a quem me desculpo por não ter estado mais disponível; ao meu Padrinho

pelo seu carinho, preocupação, pelas conversas elucidativas em relação às minhas opções e es-

colhas passadas, presentes e futuras. E não podia deixar de estar grata pela companhia dos

meus gatos, Milo e Gôdo, por não terem desistido dos meus afetos, tantas vezes penalizados

pela minha indisponibilidade (nos últimos tempos), privados do mimo que merecem.

Aos meus amigos, lamento, eu vou encontrar outro drama com que lhes chatear a cabeça

mas obrigada, do fundo do meu coração, por me terem ajudado a ultrapassar este. Em especial

estou grata: à Cristina, por toda a sua paciência e tempo, pelos sermões que precisei ouvir, e por

me adorar ao ponto de continuar a querer aturar-me apesar da minhas "tempestades num copo

de água"; à Flávia, pela sua companhia durante todo o mestrado, dentro e fora da faculdade, e

v

por todas as palavras de apoio e motivação durante as minhas crises existenciais (fosse qual

fosse a causa); à Lara, por ser o meu "rubber ducky", pela companhia nas pausas cheias de

desespero ou de conversas sobre tudo e nada, por também ter ido a Braga e deixar-me arrastá-la

para passeios; ao Bruno, por me ajudar com as minhas dúvidas de programação, pelos "acaba-

me mas é isso!"que tantas vezes ouvi e por levar com a minha conversa infindável by proxy; à

Ana Cláudia, por me ouvir pacientemente a explicar o projeto (e os seus problemas), pelas vezes

que me obrigou a sair da bolha para desanuviar e por todo o carinho que continua a ter por mim;

à Mariana, pela sua amizade e apoio, que sempre existiram mesmo com a falta de tempo, e ler os

meus testamentos via SMS; à Raquel, pelos passeios e desculpas para fugir do laboratório e dos

meus problemas; à Ana Marta, por me deixar refugiar no laboratório do C8 (e lavar o material

de vidro), ouvir-me divagar enquanto trabalha e pelas conversas sobre gatos (ou como nós os

vemos, as nossas crianças não humanas); ao Rafael pelo apoio mútuo no nosso desespero; à

Cíntia, pela disponibilidade, paciência e ajuda durante as minhas crises existenciais e de auto-

confiança durante a escrita; à Filipa, pela admiração e por, praticamente desde o primeiro dia

da nossa amizade, me motivar a chegar ao fim do projeto por querer assistir à minha prova de

mestrado; ao Hugo e à Ioana por me esclarecem as minhas dúvidas sobre análise de dados de

NGS.

Por fim, agradeço aos meus colegas do XLDB, Cátia Pesquita, Hugo Bastos e João Ferreira

por me acolherem; ao Pedro Gonçalves e à D. Sandra Crespo pela disponibilidade e simpatia;

aos meus colegas do laboratório 6.3.30 pelas suas conversas inusitadas que tantas vezes me

quebraram o aborrecimento; e, aos membros de Computer Systems do NOVA-LINCS por me

fazerem sentir tão bem-vinda em todas as minhas idas ao seu departamento.

N. P. L.

Sintra, 25 de Setembro de 2015.

vi

Resumo

O ácido desoxirribonucleico (DNA) é das macromoléculas biológicas mais conhecidas na

sociedade, e continua a ser um grande alvo de investigação. No início dos anos 90 do século

passado, o Projeto do Genoma Humano começou com o objetivo de sequenciar a informação

contida no DNA. Passado treze anos, e cinquenta anos depois de Watson e Crick terem revelado

a estrutura em hélice dupla do DNA, a primeira sequência genómica humana foi apresentada

quase na sua totalidade; o Projeto do Genoma Humano chegara ao fim, mas não sem mudar

as ciências biológicas e a investigação biomédica. O desenvolvimento das tecnologias de se-

quenciação e a disponibilidade de sequências genómicas de organismos modelo, para além do

humano, levou a que a resequenciação se tornasse um método bastante utilizado para ler a

informação guardada no genoma. Hoje em dia, a nova geração de tecnologias de sequencia-

ção (NGS Technologies) permitem a produção rápida e a baixo custo de milhares de milhões

de pequenos fragmentos de DNA em bruto — usualmente referidos como reads — e um passo

importante na análise destes dados é alinhar as reads à sequência de um genoma de referência

para determinar o local onde pertencem, isto é, fazer o seu mapeamento.

O processo de mapear a vasta quantidade de pequenos fragmentos de DNA, com algu-

mas centenas de bases, a uma sequência genómica, por vezes bastante comprida (por exemplo,

o genoma humano tem mais de 3 milhões de pares de bases) é computacionalmente dispen-

dioso. Mais, neste processo é fundamental distinguir entre erros técnicos de sequenciação e

variações genéticas que ocorreram naturalmente (e que por vezes levam a doenças) no sujeito

da amostra. Para fazer frente ao desafio, muitas ferramentas têm vindo a ser desenvolvidas

utilizando abordagens algorítmicas diferentes, sendo que algumas delas incluem a informação

sobre a qualidade associada a cada base sequenciada, o que reporta a probabilidade de estar

errada.

Um dos métodos utilizados no mapeamento envolve procurar ao longo da sequência ge-

nómica de referência uma subsequência de uma read; depois, é feito o alinhamento entre a

read inteira e a correspondente zona do genoma. Neste trabalho, esta correspondência é ba-

seada numa tabela de dispersão (hash table) — uma estrutura de dados que associa chaves de

pesquisa a valores — que guarda pequenas subsequências do genoma e as correspondentes po-

vii

sições na sequência, funcionando como um índice remissivo. Vários algoritmos para criar as

chaves de pesquisas — subsequências — para cada read foram implementados em Java, onde a

ideia principal é que associando mais do que uma chave a cada read aumentamos a hipótese de

encontrar o local a que ela pertence, e consequentemente de mapear todo conjunto de dados.

Neste sentido, a nossa solução para o mapeamento de reads é baseado no paradigma da

programação modular, em que cada módulo é responsável por uma parte de uma série de ta-

refas onde duas se destacam: a criação de chaves de pesquisa e o alinhamento. Na criação de

chaves de pesquisa os nossos algoritmos têm em conta a semelhança entre as bases do DNA

e/ou os valores de qualidade associados às bases que compõem a read, onde partindo de uma

subsequência a troca com as restantes bases permite gerar novas chaves de pesquisa. Também

implementámos um método existente na literatura que divide a read em partes iguais e sobre-

postas.

Para o alinhamento foram implementadas três versões do método de Needleman-Wunsch,

um algoritmo de programação dinâmica específico para o alinhamento global de sequências

biológicas, que tem em conta inserções e deleções de bases no genoma da amostra em rela-

ção ao de referência. Ao alinhamento entre as duas sequências é somada uma pontuação que

mede a semelhança entre elas; deste modo, numa implementação simples apenas existem duas

situações: há correspondência entre a base da read e a do genoma ou não, havendo uma pena-

lização. Quando recorremos a uma matriz de semelhança de bases, encontrando a mesma base

no alinhamento das sequências é dada a pontuação máxima, se as bases forem estruturalmente

parecidas é dada uma pontuação menor, e em caso de não haver de todo correspondência há

uma penalização. Por fim, implementámos uma versão baseada numa existente na literatura,

que inclui a probabilidade de cada uma das quatro bases poder ser a correta no cálculo da pon-

tuação em relação à sequência genómica, e as inserções e deleções (que levam há falta de cor-

respondência, ou seja, a um intervalo entre a sequências) têm uma penalização maior.

Dado que temos vários algoritmos para a criação de chaves e para o alinhamento, uma

vantagem da nossa abordagem modular é podermos experimentar combinações diferentes en-

tre eles. As combinações possíveis foram testadas com conjuntos de dados artificiais — as reads

foram retiradas de posições conhecidas de uma sequência de referência —, construídos com

viii

valores de qualidade reais e com a simulação dos erros técnicos de sequenciação mais comuns.

A avaliação dos resultados incluiu o tempo de execução — escalabilidade —, se mapeou todas

as reads do conjunto de dados — cobertura —, e se mapeou todas as reads no local correto do

genoma de referência — precisão. Em relação ao último parâmetro ainda se considerou o facto

da read ter sido mapeada a mais do que uma posição e se foi mapeada a um ou a mais pos-

síveis locais com uma pontuação de alinhamento relevante, ou seja, cujo alinhamento tenha

resultado de uma correspondência superior a 85%. Considerar várias posições de mapeamento

para uma read é um aspeto importante, por um lado, o número de fragmentos de DNA repe-

tidos ao longo do genoma de várias espécies é um problema, por outro, alguns protocolos de

NGS dependem da quantidade de reads mapeadas a um local (como o do ChIP-Seq). Contudo,

o mapeamento incorreto pode levar a erros nos passos seguintes da análise dos dados, como

a falsa deteção de polimorfismos de nucleótido único (SNPs, do inglês single nucleotide poly-

morphisms) e do número de cópias variantes (CNVs, do inglês copy number variants). Algumas

ferramentas foram criadas orientadas para a precisão devolvendo a melhor localização para

cada read, descartando os reads restantes. No entanto, outras focadas na detecção de SNPs e

de variações de nucleótido único (SNVs, do inglês single nucleotide variations) têm em conta os

vários locais de mapeamento, sendo que o conjunto das bases mapeadas a uma dada posição

confere um grau de certeza.

Por fim, da avaliação feita ao nosso protótipo para mapeamento de reads concluímos que

é preciso melhorar a escalabilidade para que seja possível aplicarmos a ferramenta a conjuntos

de dados reais com uma dimensão bastante superior à testada. Uma vez que utilizámos dados

artificiais, tal como esperado a cobertura foi total, ou seja, todas as reads foram mapeadas à

sequência de referência. As chaves de pesquisa correspondentes a partes sobrepostas das reads

levaram a uma precisão perfeita — 100% das reads dos conjuntos simulados foram mapeadas

ao seu local de origem na sequência de referência — e a que mais reads tenham encontrado

a sua melhor posição, o que chega a ser cerca de 94% de um conjunto de dados. A versão do

método de alinhamento de Needleman-Wunsch enriquecido com uma matriz de semelhança

de bases conduz mais reads a descobrirem outras localizações possíveis, o que é reflectido num

aumento até 7%, tendo aceite variações nucleotídicas no alinhamento.

ix

Reads reais de Escherichia coli UTI89 foram mapeadas ao seu genoma, o que nos permitiu

confirmar as nossas observações sobre a escalabilidade. No entanto, apesar dos resultados ob-

tidos com os dados artificiais, com este conjunto de dados apenas as chaves de pesquisa criadas

a partir de partes sobrepostas tornaram possível que as respetivas reads encontrassem o melhor

e outros possíveis locais no genoma. Estes resultados melhoraram quando se combinou o algo-

ritmo com a versão do método de alinhamento de Needleman-Wunsch enriquecido com uma

matriz de semelhança de bases, levando a que 28% das reads encontrasse a sua melhor posição

e 11% outras possíveis posições. Por outro lado, quanto maior o número de chaves associado a

cada read maior o número de reads mapeadas, resultando num cobertura de 100%.

O trabalho futuro deverá incluir o melhoramento da escalabilidade — o que poderá in-

cluir soluções de computação na nuvem —, a gravação dos resultados do mapeamento num

ficheiro de formato SAM e a adaptação da ferramenta a reads emparelhadas, cujo mapeamento

exige uma distância máxima entre elas o que torna o mapeamento mais fiável. Adicionalmente,

a característica modular do nosso protótipo permite experimentar outros algoritmos, para as

tarefas de criar chaves de pesquisa e alinhar as sequências, e expandir a ferramenta para ou-

tras funções, que sejam específicas da aplicação de NGS que produziu os dados (por exemplo,

Bisulfite-seq).

O código da implementação encontra-se disponível num repositório público1.

Palavras-Chave

DNA; Tecnologias de NGS; Algoritmos; Mapeamento de reads

1https://github.com/NatachaPL/LLC-Read-Mapping-Pipeline.git

x

https://github.com/NatachaPL/LLC-Read-Mapping-Pipeline.git


Abstract

Mapping is a computationally expensive process, because it involves aligning a large amount of

reads with a few hundreds of base to a wide reference genome (e.g., the human genome has over

3 million base-pairs). Moreover, a major challenge is to distinguish technical sequencing errors

from biological variations that may occur in the sample. The work presented in this thesis aims

to face the mapping challenges by developing a tool that explores and enhances hash-based

approaches to increase the search space over the reference genome with the generation of mul-

tiple keys for each read, taking into account quality information and/or biological constraints in

the alignment; these keys generating algorithms were combined with different read alignment

strategies based in the Needleman-Wunsch method in a read mapping pipeline.

Finally, we evaluated our prototype regarding scalability — the time required to be ex-

ecuted —, coverage — the percentage of reads that are effectively mapped — and precision —

mapping reads to the correct location in the reference genome — with simulated datasets. Al-

though in terms of scalability much work has to be done, all the algorithmic combinations led

to a perfect coverage of the simulated datasets. As for precision, we observed that generating

multiple keys by dividing the reads in overlapping pieces is the best approach, leading to 100%

of the reads to be mapped at their original location. On the other hand, relying on a base sim-

ilarity matrix to perform the alignment led to more reads discovering other possible locations,

resulting in a 7% increase; this is a particularly interesting result when dealing with real datasets

because of the repetitive DNA sequences and genetic variations, that may occur within the gen-

ome. We also mapped real reads of Escherichia coli UTI89 to its genome sequence, allowing us

to confirm the observations about scalability, to realise that the algorithmic combination from

above is more suited to find a best and other possible locations for the reads within the gen-

ome, accounting for the 28% and 11% of reads obtained respectively for each task. Moreover, by

assigning more than one key to the reads we improved the coverage to 100%.

Keywords

DNA; NGS technologies; Algorithms; Read mapping

xi

Contents

List of Figures xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 7

2.1 From DNA Discovery to the Human Genome Sequence . . . . . . . . . . . . . . . . 7

2.2 Next-Generation Sequencing (NGS) Technologies . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Comparison Between NGS Platforms . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Errors and Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.3 Third Generation Sequencing Technologies . . . . . . . . . . . . . . . . . . . 15

2.2.4 The FASTQ File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Survey of Read Mapping Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.1 Algorithms based on Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Algorithms based on Burrows-Wheeler Transform (BWT) . . . . . . . . . . . 22

2.3.3 Best-mapper vs All-mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 Genomics meets Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

xiii

Contents

3 Read Mapping Pipeline 27

3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.2 Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.3 Read . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.4 Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.5 Exploder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.6 Aligner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.7 Combiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.8 Abstract Classes Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Results and Discussion 45

4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Escherichia coli UTI89 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5 Conclusions and Future Work 67

5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Bibliography 71

xiv

List of Figures

1.1 Cost per Raw Megabase of DNA Sequence. . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Pairing of nucleotide bases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Work flow of Sanger sequencing method versus second-generation sequencing. 12

2.3 Extract from a file in FASTQ format. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 Read Mapping Pipeline scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Example of a .properties file content. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Sliding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Scheme of the algorithm for the best key explosion. . . . . . . . . . . . . . . . . . . 35

3.5 Definition of transition and transversion. . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Similarity Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1 Runtime vs Number of Reads (NW). Results from Machine 1. . . . . . . . . . . . . 48

4.2 Runtime vs Number of Reads (NW). Results from Machine 2. . . . . . . . . . . . . 48

4.3 Runtime vs Number of Reads (NW plus SM). Results from Machine 1. . . . . . . . 49

4.4 Runtime vs Number of Reads (NW plus SM). Results from Machine 2. . . . . . . . 49

4.5 Runtime vs Number of Reads (GNUMAP-based NW). Results from Machine 1. . . 50

4.6 Runtime vs Number of Reads (GNUMAP-based NW). Results from Machine 2. . . 50

4.7 Runtime vs Exploder and Aligner Combination (1000 reads). Results from Ma-

chine 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

xv

List of Figures


chine 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51


chine 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


chine 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52


chine 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


chine 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.13 Mapping Results for 1000 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56



4.16 Rate of Reads with other Possible Locations. . . . . . . . . . . . . . . . . . . . . . . 59

4.17 Rate of Reads with a Best Location found. . . . . . . . . . . . . . . . . . . . . . . . . 60

4.18 Rate of Incorrectly Mapped Reads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.19 Runtime vs Exploder and Aligner Combination (E. coli UTI89). . . . . . . . . . . . 63

4.20 Mapping Results for E. coli UTI89. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

xvi

Chapter 1

Introduction

The beginning of the 21st century was marked by the "essentially complete" human genome

sequence (Collins et al., 2004), which led to the sudden evolution in sequencing technologies

(McPherson, 2014). This brought many challenges to bioinformatics mostly due to the availabil-

ity of an increasing amount of data at decreasing costs (Figure 1.1). From software development

for de novo assembly or sequence alignment to the design of new data-structures (Ferragina and

Mishra, 2014), without forgetting the solutions within cloud computing and big data technolo-

gies (O’Driscoll et al., 2013), today’s bioinformaticians have a lot to explore, and improve, before

the future arrives.

One feature inherent to the next-generation sequencing (NGS) technologies is the fast

production of billions of raw short contiguous DNA fragments, usually denominated as reads.

With the availability of model organisms genome sequences, particularly the human genome

sequence, an important step of NGS data analysis is the mapping process, i.e., aligning the read

to a known reference genome, as to determine its location.

1

1. Introduction

Figure 1.1: Cost per Raw Megabase of DNA Sequence. The cost to determine one megabase(a million bases, Mb) to generate raw, unassembled sequence data. From 2001 through Oc-tober 2007 represent the costs of generating DNA sequence using Sanger-based chemistriesand capillary-based instruments. Beginning in January 2008, the data represent the costs ofgenerating DNA sequence using next-generation sequencing (NGS) platforms. A hypotheticaldata reflecting Moore’s Law is used for comparison, since technology improvements that followsMoore’s Law projections are considered to be going very well. Data from Wetterstrand (2015).

1.1 Motivation

Mapping reads with hundred of base-pairs (bp) is a computationally expensive process, since

the reference sequences may have a size of millions bp, for instance the human genome has

over 3 million bp. Moreover, repetitive DNA sequences, which are common and abundant on

the genome of many species, leads to the mapping of a single read to multiple locations cre-

ating technical challenges that may result in errors and biases in downstream analysis. These

imprecise results may, in particular, lead to false inferences of single nucleotide polymorphisms

(SNPs) and copy number variants (CNVs) (Treangen and Salzberg, 2012).

2

1.2. Objectives

On the other hand, despite the undoubted impact of NGS technologies these platforms

produce a vast amount of data that imply storage solutions. Additionally, the fact that each

platform differs in features, such as read length, data format, and sequencing method, affects

the methodologies that should be employed in their analysis (Zhang et al., 2011).

Thus, many mappers — i.e., read mapping software — have been developed, which rely

on different approaches. However, many of these solutions have limitations related to scalabil-

ity — the time required to execute the mapping— when based on Burrows-Wheeler transform

method, or with memory footprint, if it is based on a hashing method (Lee et al., 2014; Hatem

et al., 2013). Furthermore, a mapping tool has to take into account coverage — the percentage

of reads that are effectively mapped — and precision — mapping reads to the correct location

in the reference genome — as to obtain the best depth — the number of reads covering a given

locus at the genome.

Due mostly to current technologies limitations, which generate reads with sequencing er-

rors, e.g., base miscalls, a major challenge is the ability to distinguish between technical errors

and biological variations that occur at the sequenced sample. Hence, if every read is mapped

and each of them is correctly mapped to a location, we will have more certainty in the consensus

sequence of the sample, which is of extreme importance when detecting genetic variants (like,

SNPs or single nucleotide variations (SNVs)) from the reference genome (O’Rawe et al., 2013).

1.2 Objectives

The work presented in this thesis aims in developing a mapping tool with high coverage and pre-

cision by exploring and enhancing hash-based approaches to increase the search space over the

reference genome with the generation of multiple keys for each read, taking into account qual-

ity information and/or biological constraints. Therefore, despite existing sequencing errors, a

read will generate several keys, which directly translate to multiple locations in the reference

genome to be searched, significantly improving the chances of finding the right localisation.

The mapping will be definitive after a positive alignment of the read with the genome, using

algorithms based in the Needleman-Wunsch method (Needleman and Wunsch, 1970). This

3

1. Introduction

strategy however aims at finding the right balance between the number of locations that are

effectively searched and the precision and coverage achieved by the mapping solution.

Another goal of this work is to develop an user-friendly tool, where the best results are ob-

tained with less parameters being required to be set by the user, while being sequencing tech-

nology platform independent.

1.3 Contributions and Results

The approach implemented, coded in Java, takes advantage of modular programming to enable,

in a simple way, the user to plug different algorithms responsible for generating the keys used in

the sequencing process, as well as the read alignment strategy. This allows to study the influence

of such algorithms in the mapping and to further extend the tool with additional algorithms to

pursue the best combination of the keys generation and read alignment algorithms. The source

code is available in a public repository1.

At the end, this work contribute with:

• A read mapping pipeline in which the reference genome is hashed and multiple search

keys for each read are used to find the genomic candidate locations;

• Four algorithms to generate multiple search keys for one read, that take into account the

Phred quality values and/or nitrogenous base similarity;

• A simple implementation of the Needleman-Wunsch method (Needleman and Wunsch,

1970) to align two nucleotide sequences, and other where a similarity matrix is used to

score the matchings between the sequences;

• A Java version of the sliding window algorithm to retrieve search keys from a read and

the variant of the Needleman-Wunsch method implemented at GNUMAP (Clement et al.,

2010);

• A tool that allows to combine these different algorithms, to generate keys and to align a

read to a reference sequence.


4



1.4. Overview

An evaluation of the prototype was made with simulated datasets and we observed that:

• It did not succeed regarding scalability, although the GNUMAP-based alignment algorithm

required less time to perform no matter which method we resort to generate the search

keys;

• As expected, all the combinations led to a 100% coverage of the datasets, i.e, every single

read was mapped to the reference sequence;

• By combining the GNUMAP-based algorithms we obtained a 100% precision, the datasets

were entirely mapped to the original positions within the reference sequence, and we had

more reads finding their best position, about 94%.

• However, the implementation of the Needleman-Wunsch method enriched with a simil-

arity matrix led to more reads discovering other possible locations, reflected in a 7% in-

crease. This is a particularly interesting result when dealing with real datasets because

of the repetitive DNA sequences within the genome, but it also has advantages in finding

SNPs and SNVs.

Real reads of Escherichia coli UTI89 were mapped to its genome sequence, allowing to

confirm the observations about scalability. Despite the results with simulated data, the proto-

type only was able to find the best and other possible locations using the GNUMAP-based al-

gorithm to create keys, getting better results when combined with the version of the Needleman-

Wunsch method enriched with a similarity matrix, inferring from the 28% and 11% of reads

obtained respectively for each task. Also, by assigning more than one search key to a read we

improved the coverage up to 100%.

1.4 Overview

The next chapter introduces the historical path which ultimately brought us to this challenge.

Afterwards, we describe the NGS methodology and we refer to the main features of the most

commonly used platforms, as well as their errors and biases. The FASTQ file format, that plays

a significant role in our approach, is also presented. To conclude Chapter 2, we provide a brief

5

1. Introduction

review of some mapping tools currently available and cloud computing solutions that may be

used to improve our tool scalability. In Chapter 3, our approach is explained. First, we describe

its architecture and then the major implementation details are presented — where we include

the search key generation algorithms developed and the alignment methods supported by our

pipeline. The evaluation of the prototype is discussed in Chapter 4, in which we also explain

how the simulated FASTQ files were obtained and present the results for the mapping of real

reads from E. coli UTI89. The final chapter provides conclusions and considerations for future

developments and discusses the aspects that we think have to be taken into account for further

improvements to our tool.

6

Chapter 2

Related Work

In this chapter, a historical background is presented to introduce some fundamental concepts of

genomics, the early beginnings of genome sequencing and the Human Genome Project (HGP)

and its consequences. Then, differences between the most widely used Next-Generation Se-

quencing (NGS) technologies are denoted, as their inherent characteristics also poses as a chal-

lenge to mapping reads, namely the sequencing errors; the FASTQ file format, which the tool

presented in this thesis uses as input, is presented as well. Additionally, this chapter focus on

mapping reads, by presenting a brief survey of algorithms for NGS data. And, finally, cloud com-

puting solutions are referred since they can be used to improve the scalability of our prototype.

2.1 From DNA Discovery to the Human Genome Sequence

For centuries, farming techniques have been used to breed crops and animals with particu-

lar traits; however, it was only in the 19th century that Gregor Mendel published the results of

his investigation with peas and described how living organisms passed traits to their offspring

(Mendel, 1866).

In early 1869, Friedrich Miescher, while trying to understand the chemical basis of life,

discovered a new class of biological molecules in purified nuclei and called it “Nuclein” (Dahm,

2010). Six decades later, Phoebus Levene described the different types of nucleic acids, ribonuc-

7

2. Related Work

leic acid (RNA) and deoxyribonucleic acid (DNA), and defined the DNA as a sequence of units –

nucleotides – composed by a phosphate group, a deoxyribose sugar and one of four nitrogenous

bases: adenine, thymine, cytosine, and guanine (Levene and London, 1929). Meanwhile, Fred-

erick Griffith, in experiments with Streptococcus pneumoniae, determines that there must be a

genetic factor which can transform the bacteria (Griffith, 1928); this "transforming factor" was

demonstrated by Oswald Avery and his colleagues as being the DNA (Avery et al., 1944). Finally,

Hershey and Chase, confirmed the DNA as the genetic material responsible for the heredity of

traits (Hershey and Chase, 1952).

From 1948 to 1952, Erwin Chargaff published a series of papers in which he concluded that

there is an adenine for every thymine, and a cytosine for every guanine in every living organ-

ism (Cohen and Portugal, 1974). These findings contributed to the DNA structure proposed by

James Watson and Francis Crick, based in Maurice Wilkins and Rosalind Franklin DNA crystals

image via X-ray (Watson and Crick, 1953a,b): the DNA is a double-helix where two helical chains

are hydrogen-bonded by complementary base pairs — adenine with thymine and cytosine with

guanine (Figure 2.1). Wilkins, Watson and Crick received the Nobel Prize in Physiology in 1962

for this discovery (Nobelprize.org, 2015b). No other macromolecule in history has had its im-

age so widespread in our society, it even received the title “The Mona Lisa of Modern Science”

(Kemp, 2003).

Figure 2.1: Pairing of nucleotide bases. Hydrogen bonds are shown dotted. Adapted from thework by Watson and Crick (1953b).

Then, in 1958, Francis Crick declared the "Central dogma of molecular biology" to explain

the transference of the information contained in DNA to proteins (Crick et al., 1970); three years

8

2.1. From DNA Discovery to the Human Genome Sequence

later, he and his colleagues published their genetic experiments that, together with other works,

allowed them to observe the degeneracy of the genetic code, which means that if an amino-

acid is coded by a triplet (a group of three nucleotides), there are 64 possibilities to represent

20 amino-acids (Crick et al., 1961). Afterwards, Nirenberg’s team was able to relate 45 out of 64

triplets with the respective amino-acid and predict the remaining nucleotide sequences (Niren-

berg et al., 1965).

An important step in "reading" the content of a DNA sequence was made by Frederick

Sanger and colleagues when they determined the DNA sequence for the genome of bacterio-

phageΦX174 (Sanger et al., 1977a). Soon after, Allan Maxam and Walter Gilbert reported an ap-

proach to sequence DNA wherein terminally labelled DNA fragments were subjected to chem-

ical cleavage specific to each base and the reaction products resolved by polyacrylamide gel

electrophoresis (Maxam and Gilbert, 1977). Yet, in the same year, Sanger describes a new se-

quencing method, applied to the genome of bacteriophage ΦX174, using DNA polymerase and

chain-terminating dideoxynucleotide analogs, thus causing base-specific termination of a newly

synthesised chain (Sanger et al., 1977b). This method revealed itself as less laborious than

Maxam’s. Both techniques led to half of the 1980’s Nobel Prize in Chemistry being jointly awar-

ded to Sanger and Gilbert "for their contributions concerning the determination of base se-

quences in nucleic acids" (Nobelprize.org, 2015a) . With the cost of computer components be-

ginning to rapidly fall, allowing laboratories to have their own computers, and DNA sequencing

becoming a faster procedure, computer programs arose as a solution to handle and analyse data

produced by sequencing experiments (Staden, 1979). Sanger’s method was adapted to ’shotgun’

sequencing, in which the DNA sequence assembly of overlapped smaller sub-sequences is per-

formed by computer software (Anderson, 1981). Further improvements on the Sanger sequen-

cing technique led to the adoption of fluorescent dyes enabling a computer-based automatic

base identification (Smith et al., 1986; Prober et al., 1987). Another example of the aid of inform-

atics in fields of biology, at this time, is the FASTA program for protein and DNA sequence simil-

arity analysis and databases search (Pearson and Lipman, 1988); nowadays, FASTA is known as

the default text-based format for biological sequences.

When Robert Sinsheimer, then chancellor of the University of California in Santa Cruz,

9

2. Related Work

proposed the possibility of sequencing the human genome in 1985, many thought his idea was

premature or even crazy, due to the demand of resources; however, in 1986, Charles DeLisi of

the U.S. Department of Energy (DOE) decided to fund research for genome sequencing and

mapping. Two years later, a special committee of the U.S. National Research Council of the U.S.

National Academy of Sciences recommended the Human Genome Project to be initiated, with

a deadline of 15 years and funding of about $200 million a year. In 1990, with James Watson

leading the National Institutes of Health (NIH) part of the now joint NIH-DOE project (Collins

et al., 2003), the Human Genome Project (HGP) started (Watson, 1990). This was the first large-

scale biology project, one that changed biology and the biomedical sciences, an international

endeavour that counted with the Sanger Centre (funded by the Wellcome Trust) and assisted by

the private sector. The HGP promoted the development of new sequencing technologies, with

its need for high-throughput generation of biological data at low cost — which was boosted by

the advent of capillary sequencing machines. Its research about the legal, ethic, and social im-

pact of the knowledge being gathered and the collection of increasing biological data to be ana-

lysed, annotated, and stored, but made publicly accessible in user-friendly databases, created a

clear need for interdisciplinary in genomics research.

Although the human genome was the flagship of the project, it also assembled the ge-

nomic sequences for the E. coli, S. cerevisae, C. elegans, D. melanogaster, and whole-genome

drafts of several others, including the mouse and the rat, which opened the door to Comparat-

ive Genomics (Collins et al., 1998, 2003). Thus, by February 2001, when the International Human

Genome Sequencing Consortium (Lander et al., 2001) and Celera Genomics (a private project

started in 1998) (Venter et al., 2001) reported the first draft of the human genome, the landscape

of biological and biomedical research had already started to change. The HGP successfully

ended 2-years earlier than initially planned (Collins et al., 2003), just in time to celebrate 50th

Anniversary of the discovery of the DNA structure; the following year was marked with almost

99% of the euchromatic genome highly accurately sequenced (Collins et al., 2004). Nevertheless,

the understanding of the information encoded in the human genome was very limited, which

lead to the launch of the Encyclopedia of DNA Elements (ENCODE) Project (Encodeproject.org,

2015) in September of 2003, in which an international consortium, organized by the National

Human Genome Research Institute (NHGRI), received the task of identifying all the functional

10

2.2. Next-Generation Sequencing (NGS) Technologies

elements encoded in the human genome sequence. There is still much to understand, however,

the results of the ENCODE project combined with other large genomic data sets may elucidate

the genetic and epigenetic factors responsible for the development and progression of human

diseases (Frazer, 2012), for example.

Since the Sanger sequencing method continues to be expensive despite having been heav-

ily refined and improved, the NHGRI initiated “Advanced Sequencing Technology Development

Projects” in 2004, to motivate the development of low cost sequencing which led to the next-

generation sequencing (NGS) technologies to start to become available. Although these high-

throughput technologies produce shorter reads, i.e., DNA fragments synthesised, when com-

pared to the Sanger method, their parallelised sequencing process produces thousands of bases

per second at significantly reduced cost (Pettersson et al., 2009). NGS technologies are im-

proving biomedical investigation with clinical implications, such as cancer treatment (Ross and

Cronin, 2011; Bohlander, 2013; Offit, 2014) and infectious disease management (Pak and Kasars-

kis, 2015), while being widely used in many biological fields. The promise of the HGP for biology,

biomedical research and health care of change (Collins et al., 1998) is fulfilled with more to come

(Green et al., 2011).

"The ever quickening advances of science made possible by the success of the Human Genome

Project will also soon let us see the essences of mental disease. Only after we understand them at

the genetic level can we rationally seek out appropriate therapies for such illnesses as

schizophrenia and bipolar disease."

- James D. Watson (The New York Times, 2007)

2.2 Next-Generation Sequencing (NGS) Technologies

The automated Sanger method is considered a ’first-generation’ technology, in which the DNA

to be sequenced can be prepared by being randomly fragmented — sequencing library — and

then cloned to a plasmid vector and used to transform E. coli — for shotgun de novo sequen-

cing — or for PCR (Polymerase Chain Reaction) amplification carried out with primers that flank

11

2. Related Work

Figure 2.2: Work flow of Sanger sequencing method (a) versus second-generation sequencing(b). Adapted from the paper by Shendure and Ji (2008).

the target — for targeted resequencing. Both approaches output an amplified template: clonal

copies of the single plasmid insert within the bacterial colony (as depicted in Figure 2.2 (a)) or

PCR amplicons within a single reaction volume. The sequencing biochemistry takes place in

a ‘cycle sequencing’ reaction, within a microliter-scale volume, generating a ladder of ddNTP-

terminated, dye-labelled products, which are subjected to high-resolution electrophoretic sep-

aration of the single-stranded, end-labeled extension products in a capillary-based polymer gel;

finally, as fluorescently labelled fragments of discrete sizes pass a detector, the four-channel

emission spectrum is used to generate a sequencing trace and a software translates these traces

into DNA sequence, while generating error probabilities for each called base (Shendure and Ji,

12


2008).

’Second-generation’ technologies is a term used to refer multiple implementations of ’cyclic-

array sequencing’ and, although these approaches differ in biochemistry and array generation,

their work flows are conceptually similar (Figure 2.2 (b)). In comparison to Sanger sequencing,

these new technologies have the advantages of in vitro construction of a sequencing library, fol-

lowed by in vitro clonal amplification to generate sequencing features. Also, the array-based

sequencing enables a much higher degree of parallelism than conventional capillary-based se-

quencing; and, its features are immobilized to a planar surface which means they can be en-

zymatically manipulated by a single reagent volume, leading to a drop of the effective reagent

volume (Shendure and Ji, 2008). These differences combined results in a cheap production of

an enormous volume of data with shorter reads.

2.2.1 Comparison Between NGS Platforms

Although there are a few commercially available platforms, Illumina, Roche 454 Sequencing,

and Applied Biosystems SOLiD dominated the market (Zhang et al., 2011), being responsible for

a vast amount of data produced by NGS technologies. Nowadays, Illumina stands out in the NGS

industry and Roche announced the close down of its 454 operations in mid-2016 (McPherson,

2014). The reviews by Shendure and Ji (2008); Metzker (2010), and Liu et al. (2012b) explain the

details inherent to each sequencing method. The following discusses the fundamental aspects

of these to support a comparison between the three platforms 1:

• Illumina (Illumina, Inc., 2015) platforms rely on bridge PCR amplification to form clusters

with clonal DNA fragments; these fragments have free ends to which a universal sequen-

cing primer can be hybridised to initiate the sequencing reaction. Sequencing by syn-

thesis is the method adopted, wherein DNA synthesis is terminated by reversible termin-

ators following the incorporation of one of four modified nucleotides — each bearing one

of four fluorescent labels — by DNA polymerase. With sequencer options adapted to key

applications, Illumina systems have an output range from 20-39 Gb to 1.6-1.8 Tb with a

1To compare the sequencers’ output and read lengths the follow metric is used: 1 base-pair (bp); 1 000 000 bases= 1 mega base (Mb); 1 000 000 000 bases = 1 giga base (Gb).

13

2. Related Work

run time that can go from 15 to 40 hours or 1 to 6 days. Currently, the maximum read

length ranges between 2 x 125 and 2 x 150 bp depending upon on the Illumina model

employed.

• Roche 454 Sequencing (Roche Diagnostics Corporation, 2015) platforms use single stran-

ded DNA fragments that are captured by beads and emulsion PCR for clonal amplification.

The beads are deposited into individual wells where the sequencing is performed by the

pyrosequencing method; here, released pyrophosphate equals the amount of incorpor-

ated nucleotide which promotes a chemical reaction that generates visible light. Now, GS

FLX+ System can be used with two sequencing kits: one produces reads with lengths up to

1000 bp, with a typical throughput of 700 Mb within 23 hours, and the other has a typical

throughput of 450 Mb with 10 hours of run time and a read length that can go to up to 600

bp.

• Applied Biosystems SOLiD (Sequencing by Oligo Ligation Detection) (Thermo Fisher Sci-

entific Inc., 2015) sequencers also rely on emulsion PCR and adopted the technology of

two-base sequencing based on sequencing by ligation, an approach in which DNA poly-

merase is replaced by DNA ligase, as each sequencing cycle introduces a partially degener-

ate population of fluorescently labeled octamers. However, the 5500 W Genetic Analyzer

sequencer replaced the beads with direct amplification on FlowChip; depending on the

library used, read length can be 75 bp (fragment), 2 x 50 bp (mate-paired) and 50 bp x 50

bp (paired-end) with a of the throughput approximately 80 Gb to 160 Gb.

Targeted to clinical applications and small labs, Ion Torrent Systems (later acquired by

Life Technologies) launched the Personal Genome Machine (PGM), wherein DNA fragments

with specific adapter sequences are linked to surface beads (known as Ion Spheres Particles)

and then clonally amplified by emulsion PCR; proton release signals the incorporation of nuc-

leotides during synthesis. For the same market, Illumina developed the MiSeq. These two plat-

forms are similar in terms of utility and ease of work flow, however PGM has a higher sequen-

cing error rate (Quail et al., 2012). Roche has a benchtop version of the 454 Sequencing System

as well: the GS Junior System.

14


2.2.2 Errors and Biases

Although all the different approaches introduced rely on a complex interplay of chemistry, hard-

ware, and optical sensors, they differ in other mechanical details which affect the sequencing

types of errors and biases produced by each type of platform. On the end of each sequencing

pipeline is a software, that analyses the sensor data to predict the individual bases; this is re-

ferred to as base-calling.

Solexa/Illumina platforms have been reported to have increased the error rates along the

read, in which the G to T and A to C conversions are among the most frequent base substitution

errors (Dohm et al., 2008), and wrong base-calls are frequently preceded by base G, showing a

GC bias from these platforms (Bravo and Irizarry, 2010; Minoche et al., 2011). Incorrect predic-

tion of homopolymers — consecutive runs of the same base — length leads to insertion and

deletion errors associated with Roche 454 platform (Ledergerber and Dessimoz, 2011). Since

all bases of a homopolymer are included in a single cycle, its length has to be inferred from the

signal intensity, thus, quality scores do not provide a measure that a base at a given position is

correct, but merely indicate that homopolymer length has been called correctly (Dohm et al.,

2008). The Ion Torrent PGM sequencer also presents limitations in sequencing homopolymers,

leading to a large amount of indel errors, and a AT bias (Quail et al., 2012). Finally, SOLiD ma-

chines that implement the sequencing-by-ligation method are incapable of sequencing through

palindromic regions (Huang et al., 2012).

Softwares that aim at correcting errors, such as Fiona (Schulz et al., 2014), have emerged

as solutions to improve the downstream analysis (Yang et al., 2013).

2.2.3 Third Generation Sequencing Technologies

Second-generation sequencing technologies are commonly known as the next-generation, but

a third-generation has arisen with two main characteristics: PCR is not required before sequen-

cing, meaning shorter DNA preparation time for sequencing, and the signal is captured in real

time, i.e., the signal is monitored during the enzymatic reaction of adding nucleotides in the

15

2. Related Work

complementary strand. The single-molecule real-time (SMRT) method, developed by Pacific

Bioscience and the Nanopore are approaches that belong to this new generation of sequencing

technologies (Liu et al., 2012b).

2.2.4 The FASTQ File Format

The sequencing technologies, such as Illumina and 454, produce a text-based output in which

the DNA fragments — i.e., reads — are represented by sequences with the letters A, C, G, T and

N; the first four letters represent nucleotide bases that can be present in a genome (Adenine,

Cytosine, Guanine, and Thymine respectively), and, since the sequencing reading process is not

perfect, in some cases the sequencer prefers to return a “not known” signal — hence the letter

N — instead of returning an incorrect value. These reads are known as being base or letter space

to distinguish from the colour space reads produced by SOLiD platforms.

@HWI-ST745_0097:7:1101:1005:1000#0/1

TTCTTCATACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATTACAAGATAAAAGTCT

+HWI-ST745_0097:7:1101:1005:1000#0/1

<D=<D===<C<=<<=<EA.=<C<=B:<=<===<<C<=C==B;<=<=;=C=FC5';FB5!

@HWI-ST745_0097:7:1101:1006:1000#0/1

CGCGCCAGAATGAAAAACAGAGTTCAAATTTTAAATGGACTACATCCAATGTTAAATAT

+HWI-ST745_0097:7:1101:1006:1000#0/1

=>5C?+=862>6;=@7=C=;;8<=82=87:5C=<1FB4&=98C<<C<C=:<=::;EA3<

@HWI-ST745_0097:7:1101:1007:1000#0/1

AAATGGACTACATCCAATGTTAAATATAAAAAACAAAAAGATGTAAATTTTACTGTCAC

+HWI-ST745_0097:7:1101:1007:1000#0/1

<=<<=<<=<B:<=EA.<B:=C<<==<=<=<<=<<;B;===B;B:=B:B;<<==B:=<=D

Figure 2.3: Extract from a file in FASTQ format. File produced by the ArtificialFastqGenerator(Frampton and Houlston, 2012).

The FASTQ file (Figure 2.3) format is the de facto common format for sequencing data. It

provides a simple extension of the FASTA format, which is the ability to store a numeric score

associated with each nucleotide base in a sequence. Thus, a FASTQ file consists of three different

sub-sources: the headers (identifiers), sequence bases, and quality scores. The quality score for

16


a base called is defined in terms of the estimated probability of error (Pe ):

QPhr ed =−10× log10(Pe )

Phred scores are the de facto standard representation for sequence base qualities. In the

FASTQ format Phred qualities, whose value range from 0 to 93, are encoded in ASCII charac-

ters with codes between 33 and 126 (corresponding to printable characters), which gives a very

broad range of error probabilities from 1.0 (a wrong base) to 10−9.3 (an extremely accurate base)

(Cock et al., 2010).

As illustrated in Figure 2.3, a FASTQ file format represents each read with four lines, where:

1. a first line started with the ’@’ character followed by record identifier and additional in-

formation (such as, length or paired-end read information). Similar to the header of a

FASTA file format, it is a free format field with no length limit or format restriction;

2. the second line holds the nucleotide base sequence, without white spaces, and the use of

upper case is conventional (although not mandatory);

3. the third line begins with character ’+’ and is optionally followed by the header from line

1, because it only serves to signal the end of the sequence and the start of the next line;

4. the fourth and last line contains the ASCII-encoded quality scores and it must contain as

many symbols as the number of letters in line 2.

Because of its simplicity, FASTQ has become widely used as a simple interchange file

format between tools. Solexa/Illumina has created its own versions of the FASTQ format wherein

a different range for Phred scores are used (Cock et al., 2010); however, the different formats

can be easily converted among them using Open Bioinformatic Foundation (O|B|F, 2015) tools

(BioJava, 2015; Biopython, 2015; BioRuby, 2015; BioPerl, 2015; EMBOSS, 2015). On another note,

next-generation sequence reads are typically available online at the Sequence Read Archive

(SRA) which already has tools to convert the available data to FASTQ format (NCBI, 2015).

17

2. Related Work

2.3 Survey of Read Mapping Algorithms

The emergence of NGS platforms enabled the production of billions of short-reads with their

massively parallelised sequencing methods. On the other hand, the Human Genome Project

established reference sequences for the human genome and some model organisms, such as

E. coli, S. cerevisae, mouse and rat (Collins et al., 1998, 2003) enabling the resequencing using

short-reads. Hence, NGS technologies have allowed to broaden the applicability spectrum of

genomic sequencing, being finding the true location of a read within a genome a crucial step

in many projects and its result will affect the downstream analysis. Today, an investigator has a

fairly number of mappers — i.e., software to map the reads against a reference genome — avail-

able, which go from the popular ones, like Bowtie (Langmead et al., 2009b), with the advantage

of being widely used and constantly updated 2, or a recent one that aimed to outperform the

existing tools with a new approach, such as Arioc (Wilton et al., 2015). For instance, the works of

Holtgrewe et al. (2011) and Hatem et al. (2013) aims to help the user to choose the best tool for

his needs.

The mapping process, i.e., aligning a read to a reference genome and find its true location,

from the informatics point of view is a string matching problem. Algorithms to match strings

have been proposed far before the advent of the NGS technologies (Baeza-Yates and Perleberg,

1992); however, although reads and genomes are simple strings constructed by the letters A, C,

G, T and N, the challenge lies in distinguishing between technical sequencing errors and ge-

netic variation within the sample. Thus, read mapping becomes an approximate string problem

where the search for the read within the reference genome must allow some mismatches and

gaps between the two (Reinert et al., 2015), while at the same time managing efficiently large

amounts of data as well as a large search space in the form of a wide reference sequence. The

advances in sequencing technology have stimulated the software development with many ap-

proaches arising from the beginning (Li and Homer, 2010; Fonseca et al., 2012). However, most

of the fast alignment algorithms build auxiliary data structures — the indices — for the reads

or the reference sequence to find the genomic positions for each read, and we can group the

mapping tools based on the method used to build the index: hash tables or Burrows-Wheeler

2Bowtie: http://bowtie-bio.sourceforge.net/index.shtml

18

2.3. Survey of Read Mapping Algorithms

Transform (BWT) (Burrows and Wheeler, 1994).

2.3.1 Algorithms based on Hash Tables

All hash table based algorithms essentially follow the same ’seed and extend’ paradigm stated

by BLAST (Altschul et al., 1990). This method allows BLAST to find similar sequences not by

comparing either sequence entirely in its whole, but rather by locating short matches between

two sequences — the seeds. After this first match it extends and joins the seeds first without

gaps and then refines them by an improved Smith–Waterman alignment (Smith and Waterman,

1981; Gotoh, 1982). Finally, it outputs the statistically significant local alignments as the final

results. However, the algorithms that are relevant to our work focus on mapping a set of short

query sequences — the reads — against a long reference genome of the same species. And, for

these the spaced seed — a seed that allows internal mismatches and the number of matches is

its weight — a popular approach (Li and Homer, 2010). The detection of seeds usually follows

one of the two methods: index the reads and scan through the reference genome or index the

reference genome and align each read.

Index the reads and scan through the reference genome:

• MAQ (Li et al., 2008a), which uses the sequencing quality scores at the mapping, splits

the reads to create adaptive seeds; to speed up the alignment, it only considers positions

that have two or fewer mismatches in the first 28 bp (default parameters). MAQ relies on

an ungapped alignment, but for the small fraction of unmapped reads it will apply the

Smith-Waterman gapped alignment (Smith and Waterman, 1981).

• RMAP (Smith et al., 2008, 2009), also introduced the quality scores at the mapping, but it

creates his spaced seeds using the pigeonhole principle (Baeza-Yates and Perleberg, 1992)

— the reads are cut in k+1 pieces allowing at most for k mismatches in a mapping, which

means any mapping must have at least one seed with no mismatches. RMAP does not

consider insertions or deletions (indels), so its strategy for handling indels is to extend

initial seed matches using a Smith-Waterman-style alignment. SeqMap (Jiang and Wong,

19

2. Related Work

2008), follows the same pigeonhole principle to hash the reads, and since it splits the reads

and/or the genome into several parts, can be used in parallel on large scale data sets to

speed up the mapping process.

• RazerS (Weese et al., 2009) arose as a solution based on q-gram counting strategy, allowing

for gaps within read subsequences of a size q — the index keys — and search for multiple

matches before the extension step. RazerS 3 (Weese et al., 2012) is an improved RazerS

able to map longer reads; it supports shared-memory parallelism and adds a second read

index based on the pigeonhole principle. To extend the matches, they rely on Hamming

distance (Hamming, 1950) and on the edit distance algorithm from Hyyrö (2003).

• SHRiMP (Rumble et al., 2009) introduces a specialized algorithm for mapping colour space

reads from SOLiD sequencers, but it maps base space reads from Illumina/Solexa. It also

relies on the q-gram counting strategy to find matches between the reads and the genome,

which are extended using the local alignment algorithm by Smith-Waterman implemen-

ted using specialized “vector” instructions that are part of the CPU instruction sets and,

hence, are efficient.

Index the reference genome and align each read:

• SOAP (Li et al., 2008b), specifically designed for detecting and genotyping single nucle-

otide polymorphisms (SNPs), manages great amounts of NGS data by supporting multi-

threaded parallel computing and records the reference sequence and hash index tables

in memory. The GNUMAP (Clement et al., 2010) algorithm incorporates the base quality

scores into mapping analysis using a probabilistic variant of the Needleman-Wunsch al-

gorithm (Needleman and Wunsch, 1970) to accurately map reads with lower confidence

values; this tool creates overlapping contiguous k-mers — k-sized sequences — from the

genome sequence to build the index and splits the reads into a set of overlapping k-mers

to look up the index. Both tools were first designed for Illumina/Solexa data, but receive a

FASTQ file as input.

• SHRiMP2 (David et al., 2011) is an updated version of SHRiMP, that switched to a genome

index resulting in a dramatic speed increase and allowed to utilize multithreaded compu-

20


tation. Also to speed up the alignment, before starting the Smith-Waterman algorithm,

SHRiMP2 checks if an identical region has already been aligned to reuse the score. This

version supports Illumina/Solexa, Roche/454 and AB/SOLiD reads.

• mrFAST (Alkan et al., 2009) and mrsFAST (Hach et al., 2010) are both developed by lever-

ing the same method, which creates a collision free hash table to index k-mers from the

genome, interrogate the first, middle and last k-mers of each read in the hash table to

place initial ungapped seeds and extends the seeds with a rapid version of the edit distance

(Levenshtein, 1966); however, the former supports gaps and mismatches while the latter

supports only mismatches as to lower its execution time. mrsFAST-Ultra (Hach et al.,

2014) improves the method of mrsFAST by compacting the index and adding parallelisa-

tion and SNP-awareness features.

• Hobbes (Ahmadi et al., 2012) is based on the generating of overlapping substrings of length

q — q-grams — of the reference sequence, and constructs an inverted index of those q-

gram positions. The extension of the seeds passes through a Hamming distance (Ham-

ming, 1950) and an implementation of the edit distance by Myers (1999). Hobbes2 (Kim

et al., 2014) is built on top of Hobbes, improving its performance in all aspects and scaling

well in a multithreaded environment. The update included an additional prefix q-gram

instead of bit vectors, reducing the memory consumption.

• MOSAIK (Lee et al., 2014) is a tool with the ability to map data from all major ’second’ and

’third’ sequencing technologies, that relies on an improved Smith-Waterman algorithm

(Gotoh, 1982) to align a read to a local region of the genome. MOSAIK creates overlapping

contiguous k-mers from the genome sequence to build a hash table. The reads are split

into a set of overlapping k-mers to query the stored reference hash table and retrieve the

genomic positions of each k-mer; a modified AVL tree (AdelsonVelskii and Landis, 1963)

is employed to handle and cluster the nearby positions to form a k-mer region.

• Adaptive seeds are an alternative to fixed-length seeds, such as the spaced seeds, as they

have their length extended until the number of matches in the target sequence is less than

or equal to a frequency threshold. First proposed by Kiełbasa et al. (2011), in a BLAST

variation, the adaptive seeds are used by AMAS (Hieu Tran and Chen, 2015) to speed up

21

2. Related Work

the mapping process while preserving sensitivity and identifying all possible locations for

each read being mapped.

Recent approaches, adapted the ’seed and extend’ method to parallel approaches based

on specific hardware, like field programmable gate arrays (FPGA) (Chen et al., 2013) or graphical

processing unit (GPU) (e.g, Masher (Abu-Doleh et al., 2013) and Arioc (Wilton et al., 2015)).

2.3.2 Algorithms based on Burrows-Wheeler Transform (BWT)

The Burrows-Wheeler Transform (BWT) is a data compression algorithm (Burrows and Wheeler,

1994) that was combined with a suffix array (Manber and Myers, 1993) — a sorted array of all suf-

fixes of a string — to create the FM-index (Ferragina and Manzini, 2000). Algorithms that trans-

form the genome into a FM-index reducing the inexact matching problem to an exact matching

one: they find exact matches with the index and then create inexact alignments supported by

exact matches. An advantage of this approach is that alignment to multiple identical copies of

a subsequence in the reference is only needed to be done once, whereas with a typical hash

table index an alignment must be performed for each copy. Moreover, finding exact matches

using backwards search on a FM-index can be done in a constant time (Li and Homer, 2010).

However, despite the improvements in performance and its small memory footprint, building

a FM-index significantly takes longer than building a hash table index (which in turn requires

a large memory to index wide genomes, like the human genome) (Fonseca et al., 2012; Hatem

et al., 2013; Lee et al., 2014).

Popular BWT-based aligners are:

• Bowtie (Langmead et al., 2009b), which creates indices small enough to be distributed

over the internet and easily accessible. Bowtie does not simply adopt the exact matching

algorithm to search the FM-index, because exact matching does not allow for sequencing

errors or genetic variations. So, it introduces a quality-aware backtracking algorithm that

allows mismatches and favours high-quality alignments. It employs a ’double indexing’,

a strategy to avoid excessive backtracking. Bowtie 2 (Langmead and Salzberg, 2012) ex-

tends the method applied in Bowtie to allow gapped alignment by dividing the algorithm

22


between an ungapped seed-finding stage and a gapped extension stage, that uses dynamic

programming. Bowtie 2 relies on the efficiency of single-instruction multiple-data (SIMD)

parallel processing to accelerate the dynamic programming.

• Burrows-Wheeler Alignment tool (BWA) (Li and Durbin, 2009) emerged with an algorithm

similar to Bowtie, but with a lower search space and adapted to map base space reads,

e.g., from Illumina sequencers, and colour space reads from SOLiD machines. BWA-SW

(Li and Durbin, 2010) adds a Smith-Waterman-like dynamic programming mechanism to

BWA, so it can align long sequences up to 1000 base-pairs against a large sequence data-

base with a few gigabytes of memory. In a way, BWA-SW follows the ’seed and extend’

paradigm by finding seeds between two FM-indices, relying on dynamic programming,

and it extends a seed when it has few occurrences in the reference sequence; the seed is

allowed to have mismatches and gaps in the seeds. BWA-MEM (Li, 2013) is implemented

as a component of BWA, it also follows the ’seed and extend’ paradigm, however, it initially

seeds an alignment with supermaximal exact matches using an algorithm from Li (2012),

which essentially finds at each query position the longest exact match covering the pos-

ition. While extending a seed, BWA-MEM tries to keep track of the best extension score

reaching the end of the query sequence, as a strategy to automatically choose between

local and end-to-end alignment.

• SOAP2 (Li et al., 2009b) is an improvement of SOAP (Li et al., 2008b) where the BWT com-

pressed index is used instead of the seed algorithm for indexing the reference sequence

in the main memory; a hash table is built to accelerate searching the location of a read in

the BWT reference index and determine an exact match. SOAP3 (Liu et al., 2012a) is an

optimised version of SOAP2, that achieves a significant improvement in speed by adapt-

ing the BWT index to the graphic processing unit (GPU). SOAP3-dp (Luo et al., 2013) is the

enhanced version of SOAP3 that takes advantage of the GPU-based approach to perform

dynamic programming for aligning a read with a candidate region in genome, a modified

Smith-Waterman algorithm is implemented, and report alignments with indels and gaps.

• CUSHAW (Liu et al., 2012c) exploits the compute unified device architecture (CUDA) to

parallelise and accelerate an algorithm based on BWT that resorts to a FM-index. At the

23

2. Related Work

time of the article publication, CUSHAW did not allow insertions and deletions; thus, the

search for inexact matches was transformed to the search for exact matches of all per-

mutations of each possible bases at every position of a short read. By default, CUSHAW

supports a maximal read length of 128 (can be configured up to 256). CUSHAW2 (Liu and

Schmidt, 2012) follows the ’seed and extend’ approach, using memory efficient versions of

BWT and FM-index to generate seeds for each read; these seeds are based on maximal ex-

act matches (MEM) — exact matches that cannot be extended in either direction without

allowing a mismatch. CUSHAW2 aims to map longer reads, using the seeds to find gapped

alignments and by employing vectorization and multithreading to achieve fast execution

speed on standard multi-core CPUs. The Smith-Waterman algorithm is implemented to

compute the optimal local alignment scores. CUSHAW3 (Liu et al., 2014) supports both

base space and colour space reads, and it was developed to improve alignment sensitivity

and accuracy of CUSHAW2. It relies on a hybrid seeding approach to improve alignment

quality that creates MEM seeds based on BWT and FM-index, exact match k-mer seeds,

and variable-length seeds at different phases of the alignment pipeline. However, the hy-

brid seeding approach improves the alignment sensitivity and accuracy at the cost of a

significant loss of processing speed.

• Masai (Siragusa et al., 2013), first constructs a conceptual suffix tree of the reference gen-

ome, stores it on disk and reuses it for each read mapping job; then, at the mapping time,

the strategy to create the seeds is chosen according to the reference genome and the spe-

cified error rate. Each seed reported by a multiple backtracking algorithm is extended at

both ends by a banded version of the Myers bit-vector algorithm (Myers, 1999) presented

in RazerS 3 (Weese et al., 2012).

2.3.3 Best-mapper vs All-mapper

A best-mapper prioritizes candidate locations, and returns one or a few best mapping locations

for each read, mainly to achieve an optimal combination of speed, accuracy, and memory ef-

ficiency; moreover, BWT-based algorithms, such as Bowtie (Langmead et al., 2009b), Bowtie 2

(Langmead and Salzberg, 2012), versions of BWA(Li and Durbin, 2009, 2010; Li, 2013) apply an

24

2.4. Genomics meets Cloud Computing

exact match search to achieve that optimal combination. The hash-table based, MAQ (Li et al.,

2008a) and SOAP (Li et al., 2008b) are also best-mappers. MAQ always reports a single align-

ment, choosing a best position randomly if a read can be aligned equally well to multiple posi-

tions; and, SOAP reports the best hit of each read which has minimal number of mismatches or

smaller gap. Therefore, in case of equal best hits, the user can instruct the program to report all

or randomly report one or disregard them all.

However, for some NGS applications an all-mapping task is essential, e.g. prediction of

genomic variants or protein binding motifs located in repeat regions isoform expression quan-

tification (Alkan et al., 2009; Hach et al., 2010; Newkirk et al., 2011). And, although best-mappers

may have an option to report all mappings, since their algorithms are based in finding a unique

search, they might not perform as well as mappers specialised in identifying as many as pos-

sible, if not all, matches within a reasonable time — all-mappers. Most all-mappers follow the

’seed and extend’ paradigm in which locations reported by the seeds of a read are used as can-

didates for extending the alignment to the rest of the read. Some well regarded all-mapping tools

are mrFAST (Alkan et al., 2009) and mrsFAST (Hach et al., 2010), RazerS 3 (Weese et al., 2012),

Hobbes (Ahmadi et al., 2012), Hobbes2 (Kim et al., 2014), Masai (Siragusa et al., 2013) and AMAS

(Hieu Tran and Chen, 2015). On the other hand, when requested, MOSAIK (Lee et al., 2014)

also outputs all possible mapping locations for every read in a separate output file, behaving

simultenously as a best-mapper and an all-mapper.

2.4 Genomics meets Cloud Computing

Handling big amounts of data is a challenge known to informatics brought by the Internet and

the natural technology evolution and massification. To deal with the massive grow in the num-

ber of websites and information available, in the Internet, Google developed the MapReduce

system to process huge quantities of data efficiently and in a timely manner. This programming

model and system allows work to be distributed among large numbers of servers and carried out

in parallel; soon after, an open source project that implements the Google MapReduce system

emerged: the ApacheTM Hadoop® framework (The Apache Software Foundation, 2015b). The

25

2. Related Work

parallel data processing system of MapReduce excels at exhaustive processing — e.g, executing

algorithms that must examine every single record in a file in order to compute a result (Olson,

2010).

Cloud computing provides a scalable and cost efficient solution to manage large amounts

of data. It relies on the a pay-per-use model that has on-demand network access to a shared

platform of configurable computing resources, e.g. servers, storage, and services, which can

be rapidly provisioned and released with minimal management effort or service provider inter-

action. So, when it became more expensive to store, process, and analyse genomic data than

generate it, genomic algorithms started to leverage Hadoop (O’Driscoll et al., 2013); starting

the development of solutions such as Crossbow (Langmead et al., 2009a), a pipeline for single

nucleotide polymorphisms (SNPs) calling, CloudAligner (Nguyen et al., 2011) for sequence map-

ping, BioPig (Nordberg et al., 2013), a toolkit for sequence analysis, and CloudDOE (Chung et al.,

2014), a software to deploy a Hadoop cloud specifically thought for bioinformatics applications.

Apache SparkTM (The Apache Software Foundation, 2015a) is another MapReduce-based

cluster computing framework, which supports applications with working sets. Spark can out-

perform Hadoop by 10x in iteractive machine learning jobs and can be used interactively to scan

a 39 GB dataset with sub-second latency (Zaharia et al., 2010). SparkSeq (Wiewiórka et al., 2014),

SparkSW (Zhao et al., 2015), that relies in the Smith-Waterman (SW) algorithm (Smith and Wa-

terman, 1981) to align the sequences, and eXpress-D (Roberts et al., 2013), which targets the

problem of reads mapped to multiple locations, are Spark-based tools for NGS data analysis.

Therefore, cloud computing and big data technologies have a future within biological sci-

ences and biomedical research, enabling users to rapidly interrogate the characteristically vast

datasets produced by NGS platforms. For instance, the work by Onsongo et al. (2014) demon-

strates how NGS data analysis paired with cloud computing can be safely and reasonably used

in a clinical molecular diagnostics laboratory.

26

Chapter 3

Read Mapping Pipeline

As discussed in the previous chapter, several genomic read mapping tools have been proposed

in the past that resort to different strategies. In this chapter, we introduce the architecture of

our proposed solution, which is based on a pipeline for read mapping. We start by providing

the high level view of the architecture and then we explain the most interesting implementation

details of our prototype, as well as our proposed algorithms for generating search keys and for

read alignment in relation to a reference genome. The source code for the implementation is

available in a public repository1.

Following the ’seed and extend’ strategy, our pipeline, first, creates an index from the ref-

erence sequence: a hash table in which a k-sized subsequence is the search key of each entry,

and the value is a list of the genomic position where the subsequence can be found. Then, for

each read a subsequence of size k is retrieved serving as a key to search within the reference

sequence. When a hit is found — a seed — the whole read is aligned with the genomic sequence

— it is extended. What we propose here is to expand the search space by assigning more than

one search key for each read, i.e., to increase the number of seeds, and we developed a few al-

gorithms to do it. The extend part has an algorithm of its own to perform the alignment between

a read and a genomic subsequence, which can also be changed.


27



3. Read Mapping Pipeline

3.1 Architecture

Our approach follows a modular programming paradigm, where each module is responsible for

one part of the pipeline, and allows to explore different combinations of the keys generation and

read alignment algorithms:

• Genome: reads a FASTA file with the reference sequence in text format to be hashed. The

hashed reference sequence is stored here, as well as the information required by some of

the following modules of the pipeline;

• Read: gives a functional sense to the information retrieved from a FASTQ file (Cock et al.,

2010) that corresponds to a single read, which is characterised by its identification header,

base sequence, i.e. the read itself, and the probabilistic values associated with the quality

of each base in the read;

• Probabilities: the quality scores for each base in the read are encoded in ASCII, here they

are converted to probabilistic values. It also calculates the probabilities values for remain-

ing other bases, meaning their probability of being the correct base in that position of the

read assuming that the one called is wrong.

Since one of our goals is to try different algorithms to create search keys from the reads and to

align them with the reference sequence (this step effectively controls the regions of the reference

genome that is inspected for each read), we also have the modules:

• Exploder — which will receive the algorithms to generate multiple search keys for a read;

• Aligner — to process the alignment between the Read base sequence and a substring of

the Genome using different implementations of the Needleman-Wunsch method (Needle-

man and Wunsch, 1970) .

Finally, we also consider two different ways to present the results — one based on a top

rank and another which is GNUMAP-based (Clement et al., 2010) — the Combiner module is

responsible for the interconnection between the components presented, being responsible for

managing the flow of the pipeline (Figure 3.1).

28

3.2. Implementation

@HWI-ST745_0098:8:191:1001:1000#0/1

AAAATTGAGATAAGAAAACATTTTACTTTTTCAAAATTGTTTTCATGCTAAATTCAA

AACGTTTTTTTTTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCT

GCCTGCCAACCTATATGCTCCTGTGTTGCCAACCTATATGCTCCTGTGTTtaggcct

+HWI-ST745_0097:7:191:1001:1000#0/1

!;$"=<&<==<B>)=@;<<<<<A9<=<<=C==C<B;<=<<C<===B;=<<===<<FA

/=D@,<EA2;<<<<=====A:C<<==<<;==<D@,=<C==;@9===C==<D@D==D=

<D=89B;<=C=<<B;A9EA2"<<;=EB4%=<=<=FB0<==<<<;=<<:C==;HD7*<

@HWI-ST745_0097:7:850:1001:1000#0/1

ATATGGTAGCTACAGAAACGGTAGTACACTCTTCTGAAAATACAAAAAATTTGCAAT

TTTTATAGCTAGGGCACTTTTTGTCTGCCCAAATATAGGCAACCAAAAATAATTGCC

AAGTTTTTAATGATTTGTTGCATATTGAAAAAAACATTTTTCGGGTTTTTTGAAATG

+HWI-ST745_0097:7:850:1001:1000#0/1

!;$"=<&<==<B>)=@;<<<<<A9<=<<=C==C<B;<=<<C<===B;=<<===<<FA

/=D@,<EA2;<<<<=====A:C<<==<<;==<D@,=<C==;@9===C==<D@:D==D

=<D=89B;<=C=<<B;A9EA2"<<;=EB4%=<=<=FB0<==<<<;=<<:C==;HD7*

@HWI-ST745_0097:7:1000:1001:1000#0/1

TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACAGAAACGGTTGTGCACTCATCT

GAAAGTTTGTTTTTCTTGTTTTCTTGCACTTTGTGCAGAATTCTTGATTCTTGATTC

TTGCAGAAATTTGCAAGAAAATTCGCAAGAAATTTGTATTAAAAACTGTTCAAAATT

+HWI-ST745_0097:7:1000:1001:1000#0/1

!;$"=<&<==<B>)=@;<<<<<A9<=<<=C==C<B;<=<<C<===B;=<<===<<FA

/=D@,<EA2;<<<<=====A:C<<==<<;==<D@,=<C==9===C==<D@:D==D=<

D=89B;<=C=<<B;A9EA2"<<;=EB4%=<=<=FB0<==<<<;=<<:C==;HD7*<=

(...)

FASTQ file

>Random Species Genome_chrI_third

Gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagc

aagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaa

gcctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagc

ctaagcctaagcctaagcctaagcctaagcctaagcctaagcctaagcct

aagcctaagcctaagcctaagcctaagcctaaAAAATTGAGATAAGAAAA

CATTTTACTTTTTCAAAATTGTTTTCATGCTAAATTCAAAACGTTTTTTT

TTTAGTGAAGCTTCTAGATATTTGGCGGGTACCTCTAATTTTGCCTGCCT

GCCAACCTATATGCTCCTGTGTTtaggcctaatactaagcctaagcctaa

gcctaatactaagcctaagcctaagactaagcctaatactaagcctaagc

ctaagactaagcctaagactaagcctaagactaagcctaatactaagcct

aagcctaagactaagcctaagcctaatactaagcctaagcctaagactaa

gcctaatactaagcctaagcctaagactaagcctaagactaagcctaaga

ctaagcctaatactaagcctaagcctaagactaagcctaagcctaaAAGA

ATATGGTAGCTACAGAAACGGTAGTACACTCTTCTGAAAATACAAAAAAT

TTGCAATTTTTATAGCTAGGGCACTTTTTGTCTGCCCAAATATAGGCAAC

CAAAAATAATTGCCAAGTTTTTAATGATTTGTTGCATATTGAAAAAAACA

TTTTTCGGGTTTTTTGAAATGAATATCGTAGCTACAGAAACGGTTGTGCA

CTCATCTGAAAGTTTGTTTTTCTTGTTTTCTTGCACTTTGTGCAGAATTC

TTGATTCTTGATTCTTGCAGAAATTTGCAAGAAAATTCGCAAGAAATTTG

TATTAAAAACTGTTCAAAATTTTTGGAAATTAGTTTAAAAATCTCACATT

TTTTTTAGAAAAATTATTTTTAAGAATTTTTCATTTTAGGAATATTGTTA

TTTCAGAAAATAGCTAAATGTGATTTCTGTAATTTTGCCTGCCAAATTCG

TGAAATGCAATAAAAATCTAATATCCCTCATCAGTGCGATTTCCGAATCA

GTATATTTTTACGTAATAGCTTCTTTGACATCAATAAGTATTTGCCTATA

TGACTTTAGACTTGAAATTGGCTATTAATGCCAATTTCATGATATCTAGC

CACTTTAGTATAATTGTTTTTAGTTTTTGGCAAAACTATTGTCTAAACAG

ATATTCGTGTTTTCAAGAAATTTTTCATGGTTTTTCTTGGTCTTTTCTTG

(...)

FASTA file

input

Read

Genome

Probabilities

output

Exploder

Combiner

Aligner

Figure 3.1: Read Mapping Pipeline scheme. The arrows indicate how the different modules areconnected. In white are represented the ones that can use different algorithms to execute theirassigned tasks.

3.2 Implementation

Java SE-1.7 was used to implement the prototype of our solution, for which the user provides a

FASTA file —- the reference sequence in text format—, a FASTQ file — containing a set of reads

— and a .properties file (Figure 3.2) through the command line. The later file contains all the

parameters our pipeline needs, and gives the advantage of not having to write all the required

parameters values at the console each time we run the tool for a different (or not) set of files

without changing the parameters values, i.e., it keeps the command line less verbose.

A Class (Oracle Corporation, 2015n) named Start has the main method (Oracle Corpora-

tion, 2015p), wherein the Genome and the Read objects (Oracle Corporation, 2015q) are created

from the files provided. Using an instance of the class Properties (Oracle Corporation, 2015h) we

can load the following relevant parameters from the .properties file:

29


• the k value to be used by classes Genome, to hash the reference sequence, and Read, to

return a k-sized key from the read that will serve as a base by an instance of Exploder for

the generation of new keys;

• the names of the classes used to instantiate the Exploder and Aligner modules, as we allow

the user to write its own mechanisms for these modules, and run them in our pipeline

with ease (an interface has to be respected when implementing new approaches for these

components);

• the name of the class that will be used to present the results for the class Combiner; and,

• the location, where the output files should be stored.

k=10 exploder=keys_exploder.Exploder_0 aligner=alignment.NW comb_type = util.LLC_Comb cuttof = 0.90 threshold = 0.90 top = 3 output_dir = /local DEBUG=false

Figure 3.2: Example of a .properties file content. In the case of the algorithms the user mustpay attention to the names since the code is divided into packages (Oracle Corporation, 2015o).Here, the subclasse Exploder 0 is set to instantiate the Exploder and NW the Aligner; al-though both top and threshold have values assigned, the Combiner object will be created withLLC_Comb that uses the display based on a top rank.

In addition, the .properties file has three parameters which use depends on the class names

given: the cutoff, which is only used by some instances of the class Exploder; the top is re-

quired to present the results based on a top rank, and the threshold in case the user choose the

GNUMAP-based (Clement et al., 2010) way to sort the results. Hence, an object of the LLCProp-

erties is created with the mentioned Properties instance, allowing to retrieve these parameters

without passing them through the constructors of the classes Exploder and Combiner. This

way, the values for the cutoff, the top and the threshold are only called when needed (Figure 3.2).

30

3.2. Implementation

3.2.1 Multithreading

Mapping billions of reads to a wide reference genome is a computationally heavy process, even

more if we intend to expand the search space within the reference genome. Resorting to multi-

threading parallel computing, as Li et al. (2008b) did and others followed, will lighten the pro-

cess and make it globally faster by distributing the work across various processors. Hence, the

multithreading feature, supported in all current operative systems is extensively used by our

prototype, so that we can distribute the work between available processor cores in a machine.

The class Start holds the main functionality that materialises the use of multithreading.

A thread pool is established using the newFixedThreadPool method (Oracle Corporation, 2015r)

and to execute a new thread the class ReadProcessing, that implements the interface Runnable

(Oracle Corporation, 2015l), was created.

We have as many threads as the number of processors available to the Java virtual machine

(minus one, which we decided to keep free to minimize interference with other operative system

tasks), and an equal number of lists with Read objects and of FileWriter instances — to write the

output files — are also created and associated with each of the processing threads. This allows

threads to own their own resources and operate with minimal coordination, a key aspect to

ensure a good performance. Afterwards, each thread, executed with a new instance of the class

ReadProcessing, receives a list of Read objects, the Genome object, a FileWriter object and the

information retrieved from the .properties file. For each Read instance, the implemented run()

method creates an instance of the class Combiner to call its combine method to process the

mapping pipeline. The pipeline results are retrieved with the getResults method from Combiner

and an instance of the class StringBuffer (Oracle Corporation, 2015i) constructs a String from

them, which the FileWriter object writes in an output file.

3.2.2 Genome

The Genome class was created to manage the information retrieved from the FASTA file, that

contains the genomic sequence in text format. First, the genomic reference sequence is as-

31


sembled and an index is created, where each entry has a subsequence — the key —- and a list

of genomic positions in which the key is found — the value. Therefore, an instance of this class

is constructed with the name of the FASTA file and the value of k. Then, a StringBuffer instance

creates a String from the sequence lines read from the FASTA file — i.e., genome. The genome

is then converted to a character array, so it can be iterated and directly accessed (for perform-

ance), and a HashMap (Oracle Corporation, 2015e) structure stores a k-sized sequence of bases

— key — and an ArrayList (Oracle Corporation, 2015b) with its the genomic positions — value.

Afterwards, the index will be scanned to retrieve the genomic locations that match a sub-

sequence — a key — from the read, i.e, a seed is searched. Once a hit is found, for the extend part

a genomic subsequence is needed to be aligned with the read; hence, this class also implements

the method genSeq, where given the reference sequence as an array of characters, the size of

the read and the list of genomic positions, obtained from the seed search, it returns an ArrayList

containing a collection of SimpleEntry (Oracle Corporation, 2015a) instances. Because, we need

to know to where the read was mapped (in case of a positive alignment), each SimpleEntry in-

stance is composed by a subsequence of the genome to be aligned — a character array — and

its respective position — an Integer.

3.2.3 Read

At the method that serves as entry point to the Start class, the FASTQ file is read; each Read

object is created with the identification header, base sequence and line of scores for a read, and

with the value of k. The ASCII coded scores are used to create an array of Probabilities instances,

from which we get the probability associated with each of the four bases for each position of the

read.

The class Read implements the following methods:

• simpleKey — that returns a k-sized subsequence from the beginning of the read;

• bestKey — wherein the subsequence of k bases with the best score, obtained by a sliding

window algorithm (Clement et al., 2010) (Figure 3.3), is returned; and

32

3.2. Implementation

• getSeqAlign — which returns the read as a character array to be used at the alignment by

an instance of the class Aligner.

The methods simpleKey and bestKey will be used by classes of the module Exploder.

A GTGAAGCTTC TAGATATTTGGCGGGTACCTCTAATTTGCCTGCCTGCCAACCTATATGCTCCTGT

k

Figure 3.3: Sliding window. A k-sized window moves one position in the sequence at a time toretrieve a subsequence with k bases (Clement et al., 2010).

3.2.4 Probabilities

At the FASTQ file the scores for each base called are ASCII coded and have values between 33

and 126 (Cock et al., 2010); we convert them to values between 0 and 93 (Phred scale) and then

to probabilistic values using the Phred equation:

QPhr ed =−10× log10(Pe )

which gives the probability of error for the base called. We want the probability of each base

being correct (P) as

P = 1−10−

QPhr ed

10

Since the FASTQ file only gives the quality for the base called we assume the uncalled bases

have the same probability:

Puncal led_base =1−P

3

in the case of the "unknown" character "N", each of the four bases has a 25% probability of being

correct. Hence, a Probabilities object stores for each of the four bases and "N" their probabilit-

ies of being correct.

33


3.2.5 Exploder

After the management and storage of the information from the input files, the next step in our

pipeline is to generate the keys for the reads to scan the reference genome. In this work, we

propose to expand the search space by assigning more than one search key for each read (thus,

’exploding’ the number of keys from one to several), and to do it we developed a series of al-

gorithms to do it. An Abstract class (Oracle Corporation, 2015m) allows to share the common

code and the parameters between the subclasses that implement these algorithms, and to eas-

ily select one of them to create an Exploder object to perform the task.

Our series of exploding algorithms rely on a k-sized subsequence of the read as a basis

to generate new keys. So, for simple comparison the algorithm from the subclass Exploder 00

calls the method simpleKey from class Read, to simply return the first k bases of the sequence

as a key retrieved from position zero. On the other hand, the subclass Exploder 0 returns the

key computed by the method bestKey from class Read with its position at the read; the best

key is retrieved by a sliding window algorithm (Clement et al., 2010) and corresponds to the

subsequence of k bases with the best score.

The subclass Exploder 1 implements an algorithm where the best key serves as a template

to generate new keys resorting to base permutation (Figure 3.4), the following classes Exploder

2, 3 and 4 implement versions of that algorithm where biological constraints and/or quality

values are used to reduce the number of keys generated. We also have the subclass GNUMAP

which implements the algorithm used by Clement et al. (2010) at their tool to create keys for the

reads.

The class Exploder needs a Read object and the value of k to be instantiated and collects

SimpleEntry objects composed by a String — the base sequence generated — and an Integer

— of its position in theread — in an ArrayList. The different algorithms are coded at the ab-

stract method explode, implemented in the subclasses referred, that will take the Read object,

an index value, so the algorithm knows where to start, and two ArrayLists, one to record the

temporary results and other for the final results. The executeExplosion method calls the explode

with index zero and the required parameters; and, explodeKeys returns the ArrayList with the fi-

34

3.2. Implementation

Best key: CTCACCCGTT

ATCACCCGTT

GTCACCCGTT

TTCACCCGTT

(...)

CTCACCCGTT

CACACCCGTT

CCCACCCGTT

CGCACCCGTT

CTCACCCGTT

(...) AACACCCGTT

ACCACCCGTT

AGCACCCGTT

ATCACCCGTT

(...) CACACCCGTT

CCCACCCGTT

CGCACCCGTT

CTCACCCGTT

TACACCCGTT

TCCACCCGTT

TGCACCCGTT

TTCACCCGTT

(...) TTAACCCGTT

TTGACCCGTT

TTTACCCGTT

TTCACCCGTT

(...) TAAACCCGTT

TAGACCCGTT

TATACCCGTT

TACACCCGTT

(...) TCAACCCGTT

TCGACCCGTT

TCTACCCGTT

TCCACCCGTT

(...) TGAACCCGTT

TGGACCCGTT

CGTACCCGTT

TGCACCCGTT

Figure 3.4: Scheme of the algorithm for the best key explosion. Each position of the best keyhas its base exchanged for each of the other remaining three bases to generate three new keys.The keys created will expand the search space over the genome for each read. Since the key hasa size of 10 bases, with this example 1 048 576 keys would be returned to be searched.

nal results. When needed, the cutoff value is retrieved from the .properties file using an instance

of the LLCProperties class.

Exploder 1

Our approach followed the idea of expanding the reference sequence search space by taking

into account more than one key for each read. The algorithm implemented in this subclass fol-

lows the one depicted in Figure 3.4 where from each position of a best key three new keys are

generated by a base exchange. In other, we have four nucleotide bases (A, C, G and T), for each

position a new key is created by exchanging the current base for each of the remaining three.

The new keys generated will go through the same base permutation at the next position. There-

35


fore, this algorithm generates new keys assuming every base called could be wrong exploding

the number of search keys from one to 4k .

Exploder 2

To scan a wide reference genome in search of 4k keys, even with parallel computation, requires

a great processing power. Therefore, to narrow the number of keys to search, the algorithm im-

plemented follows a scheme similar to the one depicted in Figure 3.4, but the base permutation

only occurs if the base probability (retrieved from the array if Probabilities object) is lower than

the cutoff value. Thus, only positions in which the base called has a low probability of being

correct will generate three new keys using the base permutation.

Exploder 3

Another strategy to narrow the number of keys generated by the algorithm from Exploder 1

is consider biological constraints to perform the base exchange. There is two types of nucle-

otide base substitution (Figure 3.5): between the two-ring purines or the one-ring pyrimidines

— transition — and between one purine and one pyrimidine — transversion (Freese, 1959).

Purines

Pyrimidines

Transition Transversion

A

C T

G

Figure 3.5: Definition of transition and transversion. The nitrogenous bases are divided in twogroups: pyrimidines — includes Cytosine (C) and Thymine (T) — and purines, for the double-ringed bases — includes Adenine (A) and Guanine (G).

Due to degeneracy of the genetic code, a transition is more likely to encode for the same

aminoacid and transversions have more pronounced effects. As one can see in Figure 3.5, there

36

3.2. Implementation

are twice as many possible transversions than transitions; however, approximately two out of

three single nucleotide polymorphisms (SNPs) are transitions (Collins and Jukes, 1994). Accord-

ingly, the algorithm of this subclass follows the scheme of Figure 3.4 only taking transitions into

account, i.e., for each position of the best key a new key is created by exchanging the current

base for its molecular similar base. The new keys generated will go through the same base per-

mutation at the next position. Thus, 2k keys are returned to be searched.

Exploder 4

The number of keys created by Exploder 3 can be decreased if we take the base quality scores

into account. This way, the algorithm implemented by subclass Exploder 4 restricts the key

production seen in Exploder 3 by generating a new key only if the current position of the best

key has a base called with a probability lower than the cutoff value. This means that Exploder 4

creates 1/3 of the keys comparatively to Exploder 2, if the best key has bases with a probability

lower than the cutoff value.

GNUMAP

Finally, we use a GNUMAP-based algorithm to explode keys wherein a consensus sequence of

bases is created, which means bases with lower probability of being correct are switched for

one of the remaining bases, that have higher probability; this approach was thought for the files

produced in the Solexa/Illumina pipeline where a _prb.txt has probabilities for each of the four

bases (Clement et al., 2010). A sliding window (Figure 3.3) go through the consensus string and

if the k-sized sequence does not contain a single "N" it is taken as a key with its position; other-

wise, k-sized the window moves to the next position.

3.2.6 Aligner

Bioinformatics often resorts to dynamic programming to find an alignment between two se-

quences, an example is the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), a

global sequence alignment algorithm. Although developed to align two proteins in full-length

it can also be applied to nucleotide sequences. This dynamic programming algorithm guaran-

37


tees to find the correct optimal alignment between two sequences of lengths n that are similar

across their entire lengths (has expected to occur between a read and the genomic subsequence

retrieved).

Since we aim to try three versions of the Needleman-Wunsch method that only differ in

two aspects — how an alignment score is calculated and the value attributed to a gap — we

also implemented the module Aligner as an abstract class. The class Aligner has the task of

align the base sequence of each Read (read) — a character array obtained through the method

getSeqAlign — to a subsequence of the Genome (g_seq) — a character array from the method

genSeq. The Algorithms 1 to 3 present the pseudocode followed in our implementation of the

Needleman-Wunsch method, where a matrix is build for the alignment.

Algorithm 1 Initialise matr i x.matr i x ← [length(g _seq) +1][length(r ead)+1]for i = 0 to length(g _seq) do

for j = 0 to length(r ead) doif i = 0 then

matr i x [i ][

j]←− j

else if j = 0 thenmatr i x [i ]

[j]←−i

elsematr i x [i ]

[j]← 0

end ifend for

end for

Algorithm 2 Fill matr i x.for i = 1 to length(g _seq) do

for j = 1 to length(r ead) doM atch ← matr i x [i −1]

[j −1

]+wei g ht (i , j )Inser t i on ← matr i x [i ]

[j −1

]+ g ap()Del eti on ← matr i x [i −1]

[j]+ g ap()

matr i x [i ][

j]← max(M atch,Del eti on, Inser t i on)

end forend for

In Algorithm 1, the first line — index 0 — of the matrix represents the genomic sequence

and the first column — index 0 — the read, the rest of the matrix is filled with scores according

to the equations present at Algorithm 2. At the end, the Algorithm 3 deduces the best alignment

38

3.2. Implementation

Algorithm 3 Compute AlignmentAli g nmentGen ← ""Ali g nmentRead ← ""i ← length(g _seq)j ← length(r ead)while i > 0 and j > 0 do

if matr i x [i ][

j]← matr i x [i −1]

[j −1

]+wei g ht (i , j ) thenAli g nmentGen ← g _seq [i −1]+ Ali g nmentGenAli g nmentRead ← r ead

[j −1

]+ Ali g nmentReadi ← i −1j ← j −1

else if matr i x [i ][

j]← matr i x [i ]

[j −1

]+ g ap() thenAli g nmentGen ← "−"+ Ali g nmentGenAli g nmentRead ← r ead

[j −1

]+ Ali g nmentReadj ← j −1

elseAli g nmentGen ← g _seq [i −1]+ Ali g nmentGenAli g nmentRead ← "−"+ Ali g nmentReadi ← i −1

end ifend whiler ever se(Ali g nmentGen)r ever se(Ali g nmentGen)

39


tracing back the matrix starting from the last cell to be filled with the scores, i.e., the bottom

right cell. From there it moves regarding the score value in three possible directions: diagonally

(towards the top-left corner of the matrix) — the bases from the two sequences are aligned —,

left — we assume an insertion relatively to the genome, a gap is introduced in the genomic sub-

sequence —, or up — we assume a deletion occurred, a gap is introduced in the read sequence.

Once the top-left cell is reached the alignment is complete, and since the sequences are aligned

backwards, the resulting strings must be reversed.

The following classes implement the abstract methods weigth(i, j) — to score the aligned

characters — and gap() — the value added when a character aligns with a gap —, used in Al-

gorithms 2 and 3. Since in the matrix the values for the genomic sequence start at (1, 0) and

for the read at (0, 1), the matrix cell (i, j) corresponds to the alignment between the characters

g _seq [i −1] and r ead[

j −1].

NW

Corresponds to a simple implementation of the Needleman-Wunsch algorithm in which the

gap() returns the value -1 and weight(i, j) returns 1 when g _seq [i −1] matches with r ead[

j −1]

or -1 in case of a mismatch.

NW plus Similarity Matrix

In this class, the Needleman-Wunsch method is enriched with a similarity matrix (Figure 3.6

(a)), which means a perfect match has the value of 2 and a match between similar bases scores

1. Thus, if a transition (Figure 3.5) occurs in the alignment is not dismissed as a mismatch lower-

ing the alignment score, but contributes to the global score. In other words, this method allows

for mutations due to similar base exchange in the alignment.

The method gap() returns the value -1, but the method for weight(i, j) returns a value fol-

lowing a similarity matrix (Figure 3.6 (a)), where a perfect match has the value of 2 and a mis-

match of -1. Because we are taking into account base similarity at the alignment, if a ’N’ occurs

either in the reference sequence or the read the similarity matrix returns zero.

40

3.2. Implementation

GNUMAP

This version of the alignment method is based on GNUMAP (Clement et al., 2010), where the

gap() returns the value of -4 and the weight is calculate taking into account the probabilities of

the four bases and a simple matrix (Figure 3.6 (b)) (mostly to ease the implementation, since it

follows the same score system of the first version — NW):

wei g ht (i , j ) = ∑b ∈A,C ,G ,T

P b × costg _seq[i−1],b

where the costg _seq[i−1],b is retrieved from the matrix.

A G C T

A 2 1 -1 -1

G 1 2 -1 -1

C -1 -1 2 1

T -1 -1 1 2

(a) A G C T

A 1 -1 -1 -1

G -1 1 -1 -1

C -1 -1 1 -1

T -1 -1 -1 1

(b)

Figure 3.6: Similarity Matrices. (a) Matrix used in the subclasse "NW plus Similarity Matrix";(b) matrix for the GNUMAP-based implementation of the Needleman-Wunsch method. In bothcases, the implementation of the matrix returns zero when an "N" appears in the alignment.

From the Read object we have an array of Probabilities objects with the probabilities of

each of the four bases being the correct one. The method for weight(i, j) sums all this probabilit-

ies weighted with the alignment score retrieved from the matrix in Figure 3.6 (b). Which means,

this version of the alignment method does not just dismiss a mismatch, it assumes all the bases

have a chance to be the right one.

3.2.7 Combiner

At the supporting class ReadProcessing an instance of the class Combiner is created using the

Genome and a Read objects and the k parameter. The class Combiner is responsible for man-

aging the connections between the components of the pipeline (Figure 3.1). To accomplished

41


this we implemented the method combine, which is invoked with the names of classes that

implement the algorithms to be used to materialise the modules Exploder and Aligner, these

names are taken from the .properties file. Therefore, this method is responsible of the mapping

process by calling the following methods to combine the algorithms, also from class Combine:

• getKeys — receives the name of the algorithm for exploding keys — a String object —, the

Read object and the k; then an instance of the class Exploder is created to invoke its meth-

ods executeExplosion and explodeKeys. Afterwards, the ArrayList of SimpleEntry<String,

Integer> is returned with the keys created and respective position at the read;

• keySearch — takes the ArrayList from the previous method and the Genome object. The

positions of each key at the reference genome that are to be searched and the hits are

returned in an ArrayList<Integer>. The positions at the read returned with the key were

considered in the search, i.e., if the search key came from position 5 the genomic position

(p) returned would be p−5. And, a HashSet (Oracle Corporation, 2015f) is used to tempor-

ally store the genomic positions found to avoid repeated genomic locations for the same

read;

• computeAlignment — with the alignment algorithm name, the Genome and the Read ob-

jects and the ArrayList returned from the keySearch, this method will compute the align-

ment. First, for each genomic position found a SimpleEntry, composed by a subsequence

of the genome to be aligned — a character array — and its respective position — an Integer

— is retrieved with the method genSeq, from the class Genome; then an Aligner instance

is created with the given name, the key of the SimpleEntry<String, Integer> and the Read.

The alignment result is recorded in a new instance of AlignerResult, which is created with

the read header from the FASTQ file, the sequence obtained from the alignment, its gen-

omic position — the value of the SimpleEntry — and its score.

PositionScore is another supporting class created to manage the alignment results; it has

the same parameters of AlignerResult, but allows to show the results according to the position

found at the genome or the score obtained for the alignment. PositionScore is an abstract class

with the follow concrete implementations:

42

3.2. Implementation

• PositionScore_S — implements the methods to compare the scores obtained for the align-

ment; and,

• PositionScore_P — compares the results by its positions found at the genome.

Both rely on the interface Comparable<T> (Oracle Corporation, 2015k) in the implementation.

Two different implementations of the Combiner were created, each having a different im-

plementation of the results_display() abstract method:

• GNUMAP_Comb — organise the results considering the score processing method provided

by GNUMAP (Clement et al., 2010), wherein the scores are normalised and only the ones

greater than a given threshold value are displayed. The scores are sorted by the position of

the alignment, so the PositionScore objects were instantiated with the PositionScore_P

and stored in a TreeSet (Oracle Corporation, 2015j). The threshold value is retrieved from

the .properties file using an instance of LLCProperties;

• LLC_Comb — where, first, the results are sorted by score and then the top scores are

searched and ordered by position in the genome. Both subclasses of PositionScore were

used to instantiate the objects, that were stored in TreeSet structures. The top value is

retrieved from the .properties file using an instance of LLCProperties.

Finally, the method getResults() returns the ArrayList of PositionScore objects sorted as described

above.

3.2.8 Abstract Classes Instantiation

To ease the creation and usability of new subclasses for the abstract classes Exploder, Aligner

and Combiner, the instantiation of their objects only requires the name of the chosen subclass

(and the respective arguments). Thus, the respective constructor, called using the Java com-

mand "Class.forName(name).getConstructors()[0]" (Oracle Corporation, 2015c), and the class

Constructor<T> (Oracle Corporation, 2015d) with its method newInstance dynamically instati-

ates the class, provided that it exists, opassing to its constructor the required arguments (an

array of Objects (Oracle Corporation, 2015g)).

43

Chapter 4

Results and Discussion

In the previous chapter the read mapping pipeline created in this thesis was introduced as

standing on the paradigm of modular programming. This feature enables the plugging of the

different algorithms implemented to generate search keys from a read and to align that read to

a candidate region of the reference sequence.

Therefore, to create an Exploder object we have the algorithms implemented at the classes:

Exploder 00, which returns a k-sized subsequence from the beginning of a read; Exploder 0,

where the search key created corresponds to the best key, the k-sized subsequence of a read

with the best quality values; Exploders 1, 2, 3 and 4 that use the best key as a template to gen-

erate new search keys relying on base permutation, taking into account base similarity and/or

quality values; and, GNUMAP where the search keys are created by dividing the reads in over-

lapping k-sized subsequences, as in GNUMAP (Clement et al., 2010). As for the alignment task,

three versions of the Needleman-Wunsch method were implemented: a simple one (NW); the

NW plus SM, where the method is enriched with a base similarity matrix; and, the GNUMAP-

based NW which considers the probability of each base of being the correct one in the alignment

(Clement et al., 2010).

We tested our read mapping pipeline to evaluate its performance regarding scalability —

the time required to execute the mapping as the number of reads to map grow —, coverage —

the percentage of reads that are effectively mapped — and precision — mapping reads to the

45

4. Results and Discussion

correct location in the reference genome. In this chapter, we present the results obtained for

three simulated datasets and draw the most relevant observations. Additionally, we executed

the prototype for a strain of Escherichia coli with real data.

First of all, the simulated datasets were created resorting to the ArtificialFastqGenerator

(Frampton and Houlston, 2012), which takes a reference sequence as input and outputs artificial

FASTQ files and .readStartIndexes files, that provide the positions of the reference sequence from

where the reads were retrieved. With ArtificialFastqGenerator we can use real Phred base quality

scores from existing FASTQ files and simulate sequencing errors. Hence, from the sequence of

Mus musculus (house mouse) chromosome 191, which has over 61 mega-base-pairs, a FASTQ

file with 1 494 305 reads of 100 bases was generated with a coverage mean peak of 10 — i.e.,

the peak coverage mean for a region of the sequence. In addition, the run SRR000868 from the

454 sequencing of Escherichia coli UTI89 (Chen et al., 2006) genomic fragment library2, that

corresponds to a FASTQ file, was used to retrieve the real base quality scores, and sequencing

errors were simulated. Afterwards, from the FASTQ file created we made three datasets with

different sizes — 1000, 2000 and 3000 — of randomly selected reads, these datasets were the

input to test our read mapping tool. Note that we use artificially generated read sets as this is

the only alternative to allow us to compute the precision of the algorithm (as we effectively know

the correct position of each mapped read).

Since we want to compare the probabilistic variant of the Needleman-Wunsch algorithm,

created for GNUMAP (Clement et al., 2010) to accurately map reads with lower confidence val-

ues, with simpler versions of the alignment method, the reads were not preprocessed to remove

the ones with lower quality. For the tests, we set the cuttof value, required by the algorithms of

the Exploder, at 0.90 and used a top 3 to display the results. Exploder 1 was excluded due to

generating 4k keys, where in k equals 10 in our tests, which imply a huge processing power.

1Mus musculus chromosome 19 sequence: http://www.ebi.ac.uk/ena/data/view/CM0010122Run SRR000868: http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR000868

46

http://www.ebi.ac.uk/ena/data/view/CM001012

http://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR000868


http://www.ebi.ac.uk/ena/data/view/CM001012


4.1. Scalability

4.1 Scalability

To see how each combination of algorithms for the Exploder and the Aligner components of

the pipeline perform in terms of scalability, we plotted two types of graphics: one that relates

runtime, in hours, with the number of reads in each dataset (Figures 4.1 to 4.6) and other that

compares the combinations for the three datasets (Figures 4.7 to 4.12). We executed our read

mapping tool in two machines with different features:

• Machine 1: has 63 AMD Opteron(TM) 6272 processors with 1400.000 MHz of speed and

63 Gb of memory;

• Machine 2: has 23 Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz processors with 2400.398

MHz of speed and 62 Gb of memory.

Thus, although Machine 2 has fewer processor units than Machine 1 it requires less time to ex-

ecute our tool (Figures 4.1 to 4.12) (as the processor is faster) which shows that it is CPU intensive

(and bound).

From the Figures 4.1 to 4.6 we can observe a clear relation between the number of reads

and the runtime; notwithstanding, the 2000 reads dataset required more time than expected

(when compared with the linear relation of the other datasets) to be processed when using the

Exploder 2 to generate the search keys (Figures 4.9 and 4.10). The algorithm responsible for

this outlier relies on the quality scores, producing three new keys for each position below the

cutoff, so we may infer that some of the reads of this set had lower quality scores resulting in

more keys created, which leads the algorithm to inspect more locations of the reference genome,

and potentially align the read to more locations. Because we did not preprocessed the reads to

discard the ones with lower scores, we believe this is the case.

On the other hand, the scalability for the algorithms Exploders 00, 0, 2 and 4 is very similar

within the datasets, with the exception explained above (Figures 4.1 to 4.12), this could mean

that the best key used in Exploders 0, 2 and 4 simply corresponds to the first 10 bases, in our

case, — the simple key. However, Exploder 4 only creates one new key for each position below

the cutoff value, i.e, generates one third of the keys when compared with Exploder 2, reducing

the search space within the reference sequence.

47


0

5

10

15

20

25

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

NW

Exploder 00 Exploder 0 Exploder 2

Exploder 3 Exploder 4 GNUMAP

0

1

2

3

4

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

Figure 4.1: Runtime vs Number of Reads (NW). Relation between runtime, in hours, and num-ber of reads for each generating search keys algorithm using the simple implementation of theNeedleman-Wunsch method for the alignment. Results from Machine 1.

0

2

4

6

8

10

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

NW



0.0

0.5

1.0

1.5

2.0

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

Figure 4.2: Runtime vs Number of Reads (NW). Relation between runtime, in hours, and num-ber of reads for each generating search keys algorithm using the simple implementation of theNeedleman-Wunsch method for the alignment. Results from Machine 2.

48

4.1. Scalability

0

5

10

15

20

25

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

NW plus SM



0

1

2

3

4

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

Figure 4.3: Runtime vs Number of Reads (NW plus SM). Relation between runtime, in hours,and number of reads for each generating search keys algorithm using the Needleman-Wunschmethod enriched with a Similarity Matrix for the alignment. Results from Machine 1.

0

2

4

6

8

10

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

NW plus SM



0.0

0.5

1.0

1.5

2.0

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

Figure 4.4: Runtime vs Number of Reads (NW plus SM). Relation between runtime, in hours,and number of reads for each generating search keys algorithm using the Needleman-Wunschmethod enriched with a Similarity Matrix for the alignment. Results from Machine 2.

49


0

5

10

15

20

25

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

GNUMAP-based NW



0

0.5

1

1.5

2

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

Figure 4.5: Runtime vs Number of Reads (GNUMAP-based NW). Relation between runtime,in hours, and number of reads for each generating search keys algorithm using the GNUMAP-based Needleman-Wunsch method for the alignment. Results from Machine 1.

0

2

4

6

8

10

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

GNUMAP-based NW



0

1

2

3

4

1000 2000 3000

Ru

nti

me

(H

ou

r)

Number of Reads

Figure 4.6: Runtime vs Number of Reads (GNUMAP-based NW). Relation between runtime,in hours, and number of reads for each generating search keys algorithm using the GNUMAP-based Needleman-Wunsch method for the alignment. Results from Machine 2.

50

4.1. Scalability

0

2

4

6

8

10 R

un

tim

e (

Ho

ur)

Exploder

1000 reads

NW NW plus SM GNUMAP-based NW

0

0.3

0.6

0.9

1.2

1.5

Ru

nti

me

(H

ou

r)

Exploder

Figure 4.7: Runtime vs Exploder and Aligner Combination (1000 reads). Time, in hours, re-quired to execute the pipeline for 1000 reads. Each combination between the generating searchkeys algorithms — Exploder (horizontal axis) — and the versions of the Needleman-Wunschmethod (NW — simple implementation —, NW plus SM — enriched with a Similarity Matrix —,and GNUMAP-based NW) is represented. Results from Machine 1.

0

1

2

3

4

Ru

nti

me

(H

ou

r)

Exploder

1000 reads


0.00

0.25

0.50

0.75

Ru

nti

me

(H

ou

r)

Exploder


51


0

4

8

12

16 R

un

tim

e (

Ho

ur)

Exploder

2000 reads


0

0.5

1

1.5

2

2.5

Ru

nti

me

(H

ou

r)

Exploder


0

2

4

6

Ru

nti

me

(H

ou

r)

Exploder

2000 reads


0

0.25

0.5

0.75

1

Ru

nti

me

(H

ou

r)

Exploder


52

4.1. Scalability

0

5

10

15

20

25 R

un

tim

e (

Ho

ur)

Exploder

3000 reads


0

1

2

3

4

Ru

nti

me

(H

ou

r)

Exploder


0

3

6

9

Ru

nti

me

(H

ou

r)

Exploder

3000 reads


0

0.5

1

1.5

Ru

nti

me

(H

ou

r)

Exploder


53


The Exploder 3 algorithm produces 2k keys, therefore, since k equals 10, for each read

this algorithm returns 1024 search keys resulting in greater increases between datasets (Figures

4.1 to 4.6) and in a higher runtime than with the other exploding keys algorithms (Figures 4.7 to

4.12). The GNUMAP-based exploding algorithm creates as much keys as the size of the read, our

reads have 100 bases meaning we get 100 keys per read, leading to a significantly lower runtime

when compared with the previous algorithm. Because Exploder 2 required less time than the

GNUMAP-based algorithm one can assume fewer keys were made up by the former for all the

reads of the datasets, even for the one with 2000 reads suggesting that only a few reads were

responsible for the runtime spikes observed (Figures 4.9 and 4.10).

Concerning the algorithms for the Aligner module, Figures 4.7 to 4.12 show that the com-

binations with the GNUMAP-based Needleman-Wunsch method takes considerably less time to

execute the mappings. Moreover, the implementation of the method with the similarity matrix

— NW plus SM — slightly increases the runtime when compared to the simple implementa-

tion, which could be due to access time to the matrix. Although, some results from Machine 1

(Figures 4.7 , 4.9 and 4.11) contradict this tendency, but since results from Machine 2 (Figures

4.8, 4.10 and 4.12) follows the tendency we may assume these outliers have a technical reason

related with the machine used. On the other hand, only a few implemented methods between

the versions of the Needleman-Wusnch algorithm were different, thus, we need further tests to

understand what makes the GNUMAP-based NW require a lower runtime.

Overall, our tool needs to have its scalability improved. The human genome has over 3

giga-base-pairs and billions of reads are produced by NGS platforms from a single sample, if

our read sets take such times to have their reads mapped to the comparatively small reference

sequence then our read mapping pipeline has a narrow spectrum of utility. However, since the

prototype is CPU bound, distributing the work between more machines, as in a cloud computer

platform, may be a solution.

54

4.2. Coverage

4.2 Coverage

As expected, since we used simulated data, the three datasets obtained 100% coverage, i.e, all

the reads were mapped to its original sequence, for each combination of exploding search keys

algorithms (Exploder) and variants of the Needleman-Wunsch method (Aligner).

4.3 Precision

Though precision concerns mapping the reads to their true positions at the genome, we ana-

lysed the performance of the various algorithmic combinations for the Exploder and Aligner

components of our tool considering:

• Multiple Locations: when a read was mapped to more than one location;

• Possible Locations: from the multiple locations found for a read, if more than one results

from an alignment with a score higher than 0.85 we consider these "extra" as possible

locations for the read. This aspect has greater significance when dealing with real data for

which we have no idea from where is the read mapped, and because of repetitive DNA

sequences it is a major challenge for NGS data analysis;

• Best Location: if a read is mapped to one genomic position with a score over 0.85 then it

is the best location found; at best, within our simulated datasets this corresponds to the

original location of the read at the reference sequence;

• Incorrectly Mapped: the position(s) returned do not match the original one within the

reference sequence.

The results obtained for these parameters are summarised in Figures 4.13 to 4.15. Some obser-

vations can be drawn from these figures: first, approximately 100% of the reads were mapped to

Multiple Locations regardless the combination of algorithms executed. A relevant number of

reads within each dataset was mapped to a Best Location. Despite the multiple locations result,

a small proportion of reads were mapped to other Possible Locations.

55


893 893 928

886 909 908

69 69 31 43 51

0

1000 1000 1000 1000 1000 1000

36 36 38 70

37 89

Nu

mb

er

of

Re

ads

NW

886 886 921

869 903 880

69 69 31 43 51

0

1000 1000 1000 1000 1000 1000

43 43 46 88

44 118

Nu

mb

er

of

Re

ads

NW plus SM

898 898 934

905 916 931

69 69 31 43 51

0

1000 1000 1000 1000 1000 1000

25 25 26 45 25 60

Nu

mb

er

of

Re

ads

GNUMAP-based NW

Best Location

Incorretly Mapped

Multiple Locations

Possible Locations

Figure 4.13: Mapping Results for 1000 reads. The number of reads for which was found morethan one location is under Multiple Locations of them Possible Locations are those that scoredover 0.85 at the alignment; if only one location with a score higher than 0.85 was found this isthe Best Location. The Incorrectly Mapped are the reads that did not mapped in its originalposition at the reference sequence. The horizontal axis represent each combination betweenthe generating search keys algorithms (Exploders 00, 0, 2, 3 and 4 and the GNUMAP-basedalgorithm) and our variants of the Needleman-Wunsch method (NW — simple implementation—, NW plus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW). Results for1000 simulated reads dataset.

56

4.3. Precision

1785 1785 1848

1786 1803 1837

138 138 73 97 120

0

1997 1997 1997 2000 1997 2000

76 76 76 117 76 160

Nu

mb

er

of

Re

ads

NW

1766 1766 1830

1751 1784 1796

138 138 73 97 120

0

1997 1997 1997 2000 1997 2000

96 96 96 154 96

203

Nu

mb

er

of

Re

ads

NW plus SM

1804 1804 1865 1810 1820

1881

138 138 73 97 120

0

1997 1997 1997 2000 1997 2000

49 49 49 81 49 104

Nu

mb

er

of

Re

ads

GNUMAP-based NW

Best Location

Incorretly Mapped

Multiple Locations

Possible Locations


57


2699 2699 2808

2688 2735 2733

196 196 80 133 159

0

2996 2996 2998 3000 2998 3000

97 97 102 177 98 256

Nu

mb

er

of

Re

ads

NW

2668 2668 2777

2619 2704 2646

196 196 80 133 159

0

2996 2996 2998 3000 2998 3000

135 135 141 257

136

351

Nu

mb

er

of

Re

ads

NW plus SM

2715 2715 2826

2721 2751 2824

196 196 80 133 159

0

2996 2996 2998 3000 2998 3000

75 75 78 132 76 159

Nu

mb

er

of

Re

ads

GNUMAP-based NW

Best Location

Incorretly Mapped

Multiple Locations

Possible Locations


58

4.3. Precision

To clearly see the influence of each combination on the mapping aspects, the results re-

ported in Figures 4.16 and 4.17 were plotted with the ratio of reads within each dataset for which

was found the possible and a best locations, respectively. Finally, the number of Incorrectly

Mapped reads appears to only be related with the algorithm chosen for the Exploder compon-

ent of the read mapping pipeline as it remains equal within the different implementations of the

Needleman-Wunsch method. Therefore, to see the rate of incorrectly mapped reads for each

dataset we represent the results regarding the exploding keys algorithms in Figure 4.18.

0%

5%

10%

15%

1000 2000 3000

%P

oss

ible

Lo

cati

on

s

Number of Reads

NW

0%

5%

10%

15%

1000 2000 3000

%P

oss

ible

Lo

cati

on

s

Number of Reads

NW plus SM

0%

5%

10%

15%

1000 2000 3000

%P

oss

ible

Lo

cati

on

s

Number of Reads

GNUMAP-based NW

Exploder 00

Exploder 0

Exploder 2

Exploder 3

Exploder 4

GNUMAP

Figure 4.16: Rate of Reads with other Possible Locations (%). Reads mapped to more than onelocation with an alignment score over 0.85 have other Possible Locations despite just havingone original position at the reference sequence. This results represent the rate of possible loc-ations found with each combination between the generating search keys algorithms and ourimplementations of the Needleman-Wunsch method (NW — simple implementation —, NWplus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW).

59


85%

88%

90%

93%

95%

1000 2000 3000

%B

est

Lo

cati

on

Number of Reads

NW

85%

88%

90%

93%

95%

1000 2000 3000

%B

est

Lo

cati

on

Number of Reads

NW plus SM

85%

88%

90%

93%

95%

1000 2000 3000

%B

est

Lo

cati

on

Number of Reads

GNUMAP-based NW

Exploder 00

Exploder 0

Exploder 2

Exploder 3

Exploder 4

GNUMAP

Figure 4.17: Rate of Reads with a Best Location found (%). A Best Location was found for readsmapped to one location with an alignment score over 0.85. This results represent the rate of bestlocations found with each combination between the generating search keys algorithms and theimplementations of the Needleman-Wunsch method (NW — simple implementation —, NWplus SM — enriched with a Similarity Matrix —, and GNUMAP-based NW).

Mapping results depend on the reads from a dataset, however from Figure 4.16 we can

observe that the best key used by Exploders 0 leads to the same number of possible locations

as the simple key returned by Exploder 00. Moreover, if Exploder 4 assigned more than one key

for the reads they did not improve the search for other locations within the reference sequence.

With Exploder 2 we had little contribution from the keys generated to find other locations for

the (simulated) reads. Yet, producing keys with Exploder 3 and the GNUMAP-based algorithm

seems to increase the proportion of reads mapped that may belong to more than one place,

60

4.3. Precision

being the later algorithm the greater contributor for this result with an increase up to 7%.

Furthermore, the number of reads mapped to more than one position with a relevant score

depends on the version of the Needleman-Wunsch method chosen for the Aligner (Figure 4.16).

For each dataset the number of possible locations found for the method with the similarity mat-

rix (NW plus SM) almost doubles relatively to the GNUMAP-based NW, reaching 12% of reads

finding other possible locations (being the simple implementation — NW — in between). Since

the alignment with a base similarity matrix is more tolerant to base mutations, it makes sense

that with this strategy we have found more possible locations for the reads. This is a particularly

interesting result when dealing with real data that may contain nucleotide variations.

Meanwhile, more reads found their best location (Figure 4.17) with the GNUMAP-based

NW, specially if combined with the GNUMAP-based exploding keys algorithm; for instance,

this combination resulted in 94% of the 3000 reads set finding their best location. NW and NW

plus SM led to higher results when combined with Exploder 2, observing increases of 4%, but

when discovering the best position for the reads within a dataset performed a bit worse than the

GNUMAP-based NW, being the NW plus SM implementation the least recommended for this

task.

Figure 4.17 allow us to reflect more about the effect of expanding the search space within

the reference sequence by assigning multiple keys to the reads. For instance, with the Exploders

00 and 0 we get the same results confirming that our best key is within the first k bases — our

simple key. Exploder 2 increases the chance of discover a best genomic position for a read, and

we can observe in Figure 4.17 a decrease for the dataset with 2000 reads for this algorithm and for

Exploder 4. These two algorithms rely on quality scores to create new keys, thus, as mentioned

previously, we can infer that some reads of the 2000 reads dataset have a lower quality score

within the first 10 (our value for k) bases. Hence, the increase in the runtime when resorting

to Exploder 2 to explode the keys for this dataset comparing to Exploders 00, 0 and 4. The

GNUMAP-based exploding keys algorithm only outperforms Exploder 2 when combined with

the GNUMAP-based NW and for the 2000 and 3000 reads datasets. Further tests with different

datasets may draw conclusions about the competency of Exploder 2 to find the best location for

a read when resorting to the GNUMAP-based NW for the alignment.

61


Generating new keys taking into account base similarity to switch each base of the best

key, as we did with Exploder 3, did not do much to find the best location for a read. Unless it

was combined with the GNUMAP-based NW, performing a little better than relying on the first

10 bases to do it, being this effect more observable when referencing the results from the 3000

reads dataset.

0%

2%

4%

6%

8%

1000 2000 3000

% In

corr

ect

ly M

app

ed

Number of Reads

Exploder 00

Exploder 0

Exploder 2

Exploder 3

Exploder 4

GNUMAP

Figure 4.18: Rate of Incorrectly Mapped Reads (%). The Incorrectly Mapped reads are thosethat did not mapped in its original position at the reference sequence. Since the number ofincorrectly mapped reads seems to be related with the generating search keys algorithm used,this results represent the rate for each Exploder implementation.

Finally, from Figure 4.18 we are can observe that producing keys by passing a sliding win-

dow along a read and retrieve k-sized sub-sequences as keys, as we did with the GNUMAP-based

algorithm, results in 100% precision, i.e, all the reads were mapped to its original position. With

Exploder 2 we increase the ratio of Incorrectly Mapped reads, followed by Exploder 3 and then

by Exploder 4. With these three algorithms we had a 1% increase of the rate for the dataset

with 2000 reads, supporting our observation that this dataset has a higher number of reads with

lower quality scores within the first 10 bases. Therefore, in terms of precision, too many keys

cause worse results. Using Exploders 00 and 0, that only rely on the first 10 bases to search

within the reference sequence, resulted in more incorrectly mapped reads when compared with

the previous algorithms.

62

4.4. Escherichia coli UTI89

4.4 Escherichia coli UTI89

From the real data from which we obtained the Phred quality scores we selected 2000 random

reads to map against its sample complete genome3, a sequence longer than 5 mega-base-pairs.

0

2

4

6

Ru

nti

me

(H

ou

r)

Exploder


0

0.25

0.5

0.75

1

00 0 2 4

Ru

nti

me

(M

in)

Exploder

Figure 4.19: Runtime vs Exploder and Aligner Combination (E. coli UTI89). Time, in hours,required to execute the pipeline for 2000 real reads of E. coli UTI89; a close-up in minutes isdenoted to better notice the time variation within some of the combinations. Each combinationbetween the generating search keys algorithms — Exploder (horizontal axis) — and the versionsof the Needleman-Wunsch method (NW — simple implementation —, NW plus SM — with aSimilarity Matrix —, and GNUMAP-based NW) is represented. Results from Machine 2.

Figure 4.19 compares the time required for each algorithm combination to map the real

data, and only Machine 2 was used to execute the tool for this dataset. Two aspects about the

real data have to be taken into account when compared with the simulated data: the reference

genome is smaller than the M. musculus chromosome 19 sequence, and the read length ranges

between 100 and 300 bases. Thus, when combined with the simple version of Needleman-

Wunsch method (NW) or the one with the similarity matrix (NW plus SM) the algorithms for the

Exploder that just returned the simple or the best key — Exploders 0 and 00 — or relied on base

quality scores — Exploders 2 and 4 — required very little time to execute. Yet, when creating the

search keys with Exploder 3 the time required to map the 2000 real reads was almost as long as

with the 2000 simulated data (Figure 4.10), because for every read it explodes one search key to

3Escherichia coli UTI89, complete genome: http://www.ncbi.nlm.nih.gov/nuccore/91209055?report=fasta

63

http://www.ncbi.nlm.nih.gov/nuccore/91209055?report=fasta


210. And, the GNUMAP-based algorithm, that generates as many search as the read length, al-

most doubled the runtime when compared with the same amount of simulated data. However,

perform the alignment with GNUMAP-based NW leads to significantly lower runtime regardless

the exploding keys algorithm. Moreover, these last combinations also required less time to map

the real reads than the simulated ones, which maybe be due to the size of the E. coli genomic

sequence.

From the close-up presented in Figure 4.19 we observe that Exploders 2 and 4 required

more time than Exploders 0 and 00 to execute when combined with NW and NW plus SM,

meaning that some reads have lower quality scores and consequently few keys were produced

(since Exploder 2 generates more search keys than Exploder 4 from the same positions we have

an increase in the time between the two algorithms). In addition, we confirm our assumptions

about having reads with lower quality for the spike seen in Figures 4.2 to 4.6 concerning Ex-

ploder 2.

We also analysed the coverage obtained and other parameters related with mapping, but

since with real data the goal is to find the true location of the reads within the genome precision

could not be evaluated (Figure 4.20). In terms of coverage, exploding the search key number im-

proves the chance of mapping every read, as we can see with Exploder 3 and GNUMAP-based

algorithms, which let to a coverage of 100%. Despite the number of multiple positions found, the

tool just succeed at finding the best and other possible locations for the reads if the GNUMAP-

based algorithm was chosen for the Exploder component. As seen before (Figure 4.16) we ob-

tained more reads with other possible positions with the NW plus SM strategy for the Aligner,

reaching 11% of the reads set; although, in this case, this strategy also improved the number

of the reads finding a best location to 28%. Since these results were obtained considering a

base similarity matrix, reads with genetic variations (such as single nucleotide polymorphisms

(SNPs)) did not have their alignment score penalised and may have been mapped to their true

location.

Concerning the results presented in Figure 4.20, we assume the key size (k) plays an im-

portant role if we just consider a subsequence of the read to create new keys. A relevant pro-

portion of reads were mapped to multiple locations, but since the alignment score did not ex-

64

4.4. Escherichia coli UTI89

3 3 3 39 3

474

1838 1838 1844

2000

1841

2000

0 0 0 2 0

172

1941 1941 1945 2000 1944 2000 N

um

be

r o

f R

ead

s

NW

5 5 5 68

5

568

1838 1838 1844

2000

1841

2000

0 0 0 3 0

215

1941 1941 1945 2000 1944 2000

Nu

mb

er

of

Re

ads

NW plus SM

3 3 3 35 3

430

1838 1838 1844

2000

1841

2000

0 0 0 0 0 104

1941 1941 1945 2000 1944 2000

Nu

mb

er

of

Re

ads

GNUMAP-based NW

Best Location

Multiple Locations

Possible Locations

Mapped

Figure 4.20: Mapping Results for E. coli UTI89. The number of reads for which was found morethan one location is under Multiple Locations of them Possible Locations are those that scoredover 0.85 at the alignment; if only one location with a score higher than 0.85 was found this isthe Best Location. Since it is a real dataset, we also show the number of reads Mapped to E.coli UTI89 genome. The horizontal axis represent each combination between the generatingsearch keys algorithms (Exploders 00, 0, 2, 3 and 4 and the GNUMAP-based algorithm) and ourvariants of the Needleman-Wunsch method (NW — simple implementation —, NW plus SM —with a Similarity Matrix —, and GNUMAP-based NW). Results for 2000 read reads.

65


ceeded the 0.85 threshold virtually no read was considered effectively mapped to a best and

possible locations. On the other hand, using the entire read to create new keys, as performed by

the GNUMAP-based strategy, instead of just permuting the bases within a k-sized subsequence,

helps to find locations with relevant scores. However, 2000 reads were randomly selected from a

bigger dataset, until we improve our tool scalability and test the algorithms with all the data we

have no way of draw conclusions regarding performance with real data.

66

Chapter 5

Conclusions and Future Work

To overcome the challenges brought by the data produced with NGS technologies, we developed

the read mapping tool presented in Chapter 3. Our approach followed the ’seed and extend’

paradigm, hence, the reference genome is hashed and multiple search keys for each read are

used to find the genomic candidate locations. To find this locations, we implemented four al-

gorithms to generate multiple search keys for one read, that take into account the Phred quality

values and/or nitrogenous base similarity, and one that splits the read into overlapping sub-

sequences with equal sizes, as in GNUMAP (Clement et al., 2010). For the extension of the seeds,

we implemented a simple version of the Needleman-Wunsch method (Needleman and Wunsch,

1970) to align the read with a region of the genome, a version where a similarity matrix is used

to score the matchings between the sequences, and the variant used at GNUMAP (Clement et

al., 2010).

Based on a pipeline, our tool stands on the paradigm of modular programming enabling to

plug various algorithmic combinations of the Exploder, the component of the pipeline respons-

ible of creating the search keys for the reads, and the module in which the alignment between

the read and the reference sequence occurs (Aligner). The possible combinations were eval-

uated in terms of scalability, coverage and precision, with simulated datasets of sizes ranging

between 1000 and 3000 reads from Mus musculus, allowing us to infer that our tool cannot keep

up with the throughput currently obtained with NGS platforms. However, it performed as ex-

67

5. Conclusions and Future Work

pected in terms of coverage having mapped all the reads from the datasets. The precision of

our tool is highly related to the exploding keys algorithm used, and it was flawless with the

GNUMAP-based one, which explodes the keys by passing a sliding window along a read and

retrieve k-sized subsequences as keys.

Moreover, despite generating more search keys, thus increasing the search space within

the reference sequence, Exploder 3 does not lead to better results when finding the best po-

sition for a read nor other possible positions, meaning it requires a runtime too high for the

results produced. Despite our read mapping pipeline requiring more time to be executed when

resorts to the NW plus SM implementation for the Aligner component it allowed to find more

possible positions for the reads, especially if combined with the GNUMAP-based exploding keys

algorithm, with a 6% increase. About finding the best location to map a read, the combination

between the GNUMAP-based implementation of the Needleman-Wunsch method, to align the

nucleotide sequences, with the GNUMAP algorithm, for the Exploder module, performed bet-

ter; for instance, it led to 94% of the 3000 reads set to discover it. Although, further tests with

other datasets of different sizes would provide more understanding about the effect of the Ex-

ploder 2 at this task, using that same alignment method.

To sum up, given the various combinations, when relying on the GNUMAP-based ap-

proach for the Aligner module we obtain the best results in terms of scalability; if we also gener-

ate the keys by the GNUMAP algorithm we get more reads finding their best position and better

precision. By combining this exploding algorithm with the NW plus SM at the Aligner more

possible locations are found for the reads within each dataset. However, mapping reads to mul-

tiple locations can lead to false detection of genetic variations, due to repetitive DNA fragments

at the reference genome. On the other hand, if we map every read within a set and report their

multiple locations we will have more certainty in the consensus sequence of the sample and at

finding SNPs, as an example.

We also mapped 2000 real reads from Escherichia coli UTI89, which has a smaller gen-

ome than Mus musculus, confirming our observations on scalability. Although, the read size

has a clear influence on the time required to create the search keys with the GNUMAP-based

algorithm. However, the best and other genomic locations for the reads were only discovered

68

5.1. Future Work

when the GNUMAP algorithm was chosen for the Exploder. The better results were obtained

when combined with NW plus SM to do the alignment, resulting in 28% of the reads finding

their best position and 11% being mapped to other possible locations. As for the coverage, ex-

ploding the search keys number with Exploder 3 and the GNUMAP-based strategy improved

the mapped reads proportion to 100%.

Our tool was implement in Java, so it can run on all major operating systems, e.g., Win-

dows, Linux, and Mac OS, with Java runtime environments installed, and the source code is

available in a public repository1.

5.1 Future Work

Further studies to compare the implemented versions of the Needleman-Wunsch method will

perhaps explain the lower runtime for the GNUMAP-based one. Future work must include im-

proving the scalability of our tool, and mapping all the reads of a NGS dataset to a complete

mammal genome, such as from the Mus musculus and the human, which may require explore

other alignment options. An advantage of our modular approach is the simple implementation

of new algorithms to perform specific tasks, like the alignment, without compromising the rest

of the pipeline. This way we can the try the alignment algorithm proposed by Chakraborty and

Bandyopadhyay (2013) — FOGSAA: Fast Optimal Global Sequence Alignment Algorithm, a tree-

based algorithm that claims to obtain the same results as the Needleman-Wunsch method, but

much faster — and/or see the effect of the adaptive seeds strategy from the work of Kiełbasa

et al. (2011). Once we have controlled the scalability issue we can investigate strategies to map

longer reads, as the ones promised by the third generation sequencing technologies (Wang et al.,

2013). Cloud computing can also improve scalability by allowing executions to span across an

arbitrary number of machines, and for this the ApacheTM frameworks Hadoop® (The Apache

Software Foundation, 2015b) and SparkTM (The Apache Software Foundation, 2015a) are great

candidate solutions.

Another improvement that we foresee, is to store the read alignments against reference


69



5. Conclusions and Future Work

sequences in the Sequence Alignment/Map (SAM) format, a generic alignment format that sup-

ports short and long reads (up to 128 Mbp) produced by different sequencing platforms (Li et al.,

2009a). Today, various aligners2, that read FASTQ files and assign the sequences to a position

from a reference sequence, output this simple and flexible format. SAM format current defini-

tion is at http://samtools.github.io/hts-specs/SAMv1.pdf.

Paired-end reads, i.e. reads sequenced from both ends of the same DNA fragment, can be

produced by a variety of sequencing protocols with a preparation specific to a given sequencing

technology (Treangen and Salzberg, 2012). The mapping of these reads requires a maximum dis-

tance between them, adding a constraint when finding their genomic locations, consequently

a repetitive read will be reliable mapped if its pair can be mapped unambiguously. Moreover,

paired-end alignments outperforms single-end alignments in terms of both sensitivity and spe-

cificity (Li and Homer, 2010). Hence, to adapt our tool to paired-end reads would be a valuable

improvement. As well as, analyse data generated by SOLiD sequencers, color space reads, would

be an important extension to fulfil our goal of creating a tool able to map reads from every NGS

platform. In SOLiD platforms overlapping pairs of letters are read and given digits ranging from

0 to 3 to encode the colour calls (base transitions) (Rumble et al., 2009); to record this reads with

its quality information the Color Space FASTQ (CSFASTQ) files were created (Cock et al., 2010).

The reads can be converted into bases, as presented in FASTQ files, but performing the mapping

with the color space has advantages regarding error detection.

Furthermore, the scope of applications would be broaden if we add to our tool an al-

gorithm to map reads from sequencing coupled to bisulfite conversion (Bisulfite-seq), enabling

genome-wide measurement of DNA methylation, (Kunde-Ramamoorthy et al., 2014). And, al-

though we refer our concern with mapping a read to multiple locations due to repetitive ge-

nomic regions, the analysis of data from sequencing coupled to chromatin immunoprecipit-

ation (ChIP-Seq) relies in finding regions enriched with reads; thus, mappers that consider a

read must be uniquely placed will not be up to this task (Newkirk et al., 2011). Therefore, for

future work we could consider extend our pipeline regarding data from different sequencing

techniques.

2http://seqanswers.com/wiki/SAM

70

http://samtools.github.io/hts-specs/SAMv1.pdf

http://seqanswers.com/wiki/SAM

Bibliography

Abu-Doleh, A., Saule, E., Kaya, K., and Çatalyürek, Ü. V. (2013). Masher: Mapping Long(er) Reads

with Hash-based Genome Indexing on GPUs. In Proceedings of the International Conference

on Bioinformatics, Computational Biology and Biomedical Informatics, page 341. ACM.

AdelsonVelskii, M. and Landis, E. M. (1963). An algorithm for the organization of information.

Technical report, DTIC Document.

Ahmadi, A., Behm, A., Honnalli, N., Li, C., Weng, L., and Xie, X. (2012). Hobbes: optimized

gram-based methods for efficient read alignment. Nucleic Acids Research, 40(6):e41.

Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman,

J. O., Baker, C., Malig, M., Mutlu, O., et al. (2009). Personalized copy number and segmental

duplication maps using next-generation sequencing. Nature Genetics, 41(10):1061–1067.

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment

search tool. Journal of Molecular Biology, 215(3):403–410.

Anderson, S. (1981). Shotgun DNA sequencing using cloned DNase I-generated fragments. Nuc-

leic Acids Research, 9(13):3015–3027.

Avery, O. T., MacLeod, C. M., and McCarty, M. (1944). Studies on the chemical nature of the

substance inducing transformation of pneumococcal types induction of transformation by a

desoxyribonucleic acid fraction isolated from pneumococcus type III. The Journal of Experi-

mental Medicine, 79(2):137–158.

Baeza-Yates, R. A. and Perleberg, C. H. (1992). Fast and practical approximate string matching.

In Combinatorial Pattern Matching, pages 185–192. Springer.

BioJava (2015). BioJava:CookBook3:FASTQ. <http://biojava.org/wiki/BioJava:CookBook3:

FASTQ#Convert_between_FASTQ_variants> Accessed 12.August.2015.

71

http://biojava.org/wiki/BioJava:CookBook3:FASTQ#Convert_between_FASTQ_variants

http://biojava.org/wiki/BioJava:CookBook3:FASTQ#Convert_between_FASTQ_variants

Bibliography

BioPerl (2015). FASTQ sequence format. <http://www.bioperl.org/wiki/FASTQ_sequence_

format> Accessed 12.August.2015.

Biopython (2015). SeqIO. <http://biopython.org/wiki/SeqIO#File_Formats> Accessed 12.Au-

gust.2015.

BioRuby (2015). Module: Bio::Sequence::QualityScore::Converter. <http://www.rubydoc.

info/github/aunderwo/bioruby/Bio/Sequence/QualityScore/Converter> Accessed 12.Au-

gust.2015.

Bohlander, S. K. (2013). ABCs of genomics. ASH Education Program Book, 2013(1):316–323.

Bravo, H. C. and Irizarry, R. A. (2010). Model-based quality assessment and base-calling for

second-generation sequencing data. Biometrics, 66(3):665–674.

Burrows, M. and Wheeler, D. J. (1994). A block-sorting loss-less data compression algorithm.

SRC Research Report, 124.

Chakraborty, A. and Bandyopadhyay, S. (2013). FOGSAA: Fast optimal global sequence align-

ment algorithm. Scientific Reports, 3.

Chen, S. L., Hung, C.-S., Xu, J., Reigstad, C. S., Magrini, V., Sabo, A., Blasiar, D., Bieri, T., Meyer,

R. R., Ozersky, P., et al. (2006). Identification of genes subject to positive selection in uro-

pathogenic strains of Escherichia coli: a comparative genomics approach. Proceedings of the

National Academy of Sciences, 103(15):5977–5982.

Chen, Y., Schmidt, B., and Maskell, D. L. (2013). A hybrid short read mapping accelerator. BMC

Bioinformatics, 14(1):67.

Chung, W.-C., Chen, C.-C., Ho, J.-M., Lin, C.-Y., Hsu, W.-L., Wang, Y.-C., Lee, D., Lai, F., Huang,

C.-W., and Chang, Y.-J. (2014). CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds

and Analyzing High-Throughput Sequencing Data with MapReduce. PLoS ONE, 9(6).

Clement, N. L., Snell, Q., Clement, M. J., Hollenhorst, P. C., Purwar, J., Graves, B. J., Cairns, B. R.,

and Johnson, W. E. (2010). The GNUMAP algorithm: unbiased probabilistic mapping of oli-

gonucleotides from next-generation sequencing. Bioinformatics, 26(1):38–45.

Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2010). The Sanger FASTQ file

format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic

Acids Research, 38(6):1767–1771.

Cohen, J. S. and Portugal, H. (1974). The Search for the chemical structure of DNA. Connecticut

72

http://www.bioperl.org/wiki/FASTQ_sequence_format

http://www.bioperl.org/wiki/FASTQ_sequence_format

http://biopython.org/wiki/SeqIO#File_Formats

http://www.rubydoc.info/github/aunderwo/bioruby/Bio/Sequence/QualityScore/Converter

http://www.rubydoc.info/github/aunderwo/bioruby/Bio/Sequence/QualityScore/Converter

Bibliography

Medecine, 38:551–557.

Collins, D. W. and Jukes, T. H. (1994). Rates of transition and transversion in coding sequences

since the human-rodent divergence. Genomics, 20(3):386–396.

Collins, F. S., Lander, E. S., Rogers, J., and Waterson, R. H. (2004). Finishing the euchromatic

sequence of the human genome. Nature, 431(7011):931–945.

Collins, F. S., Morgan, M., and Patrinos, A. (2003). The Human Genome Project: lessons from

large-scale biology. Science, 300(5617):286–290.

Collins, F. S., Patrinos, A., Jordan, E., Chakravarti, A., Gesteland, R., Walters, L., et al. (1998). New

goals for the US Human Genome Project: 1998-2003. Science, 282(5389):682–689.

Crick, F., Barnett, L., Brenner, S., and Watts-Tobin, R. (1961). General nature of the genetic code

for proteins. Nature, 192:1227.

Crick, F. et al. (1970). Central Dogma of Molecular Biology. Nature, 227(5258):561–563.

Dahm, R. (2010). From discovering to understanding. EMBO reports, 11(3):153–160.

David, M., Dzamba, M., Lister, D., Ilie, L., and Brudno, M. (2011). SHRiMP2: sensitive yet prac-

tical short read mapping. Bioinformatics, 27(7):1011–1012.

Dohm, J. C., Lottaz, C., Borodina, T., and Himmelbauer, H. (2008). Substantial biases in

ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Research,

36(16):e105–e105.

EMBOSS (2015). Sequence Formats. <http://emboss.sourceforge.net/docs/themes/

SequenceFormats.html> Accessed 12.August.2015.

Encodeproject.org (2015). ENCODE: Encyclopedia of DNA Elements. <https://www.

encodeproject.org/> Accessed 16.August.2015.

Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. In Found-

ations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390–398.

IEEE.

Ferragina, P. and Mishra, B. B. (2014). Algorithms in stringomics (i): Pattern-matching against"

stringomes". bioRxiv, page 001669.

Fonseca, N. A., Rung, J., Brazma, A., and Marioni, J. C. (2012). Tools for mapping high-

throughput sequencing data. Bioinformatics, page bts605.

Frampton, M. and Houlston, R. (2012). Generation of Artificial FASTQ Files to Evaluate the Per-

73

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

https://www.encodeproject.org/

https://www.encodeproject.org/

Bibliography

formance of Next-Generation Sequencing Pipelines. PLoS ONE, 7(11).

Frazer, K. A. (2012). Decoding the human genome. Genome Research, 22(9):1599–1601.

Freese, E. (1959). The difference between spontaneous and base-analogue induced mutations

of phage T4. Proceedings of the National Academy of Sciences of the United States of America,

45(4):622.

Gotoh, O. (1982). An improved algorithm for matching biological sequences. Journal of Molecu-

lar Biology, 162(3):705–708.

Green, E. D., Guyer, M. S., Institute, N. H. G. R., et al. (2011). Charting a course for genomic

medicine from base pairs to bedside. Nature, 470(7333):204–213.

Griffith, F. (1928). The significance of pneumococcal types. Journal of Hygiene, 27(02):113–159.

Hach, F., Hormozdiari, F., Alkan, C., Hormozdiari, F., Birol, I., Eichler, E. E., and Sahinalp, S. C.

(2010). mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods,

7(8):576–577.

Hach, F., Sarrafi, I., Hormozdiari, F., Alkan, C., Eichler, E. E., and Sahinalp, S. C. (2014). mrsFAST-

Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic

Acids Research, page gku370.

Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System technical

journal, 29(2):147–160.

Hatem, A., Bozdag, D., Toland, A. E., and Çatalyürek, Ü. V. (2013). Benchmarking short sequence

mapping tools. BMC Bioinformatics, 14(1):184.

Hershey, A. D. and Chase, M. (1952). Independent functions of viral protein and nucleic acid in

growth of bacteriophage. The Journal of General Physiology, 36(1):39–56.

Hieu Tran, N. and Chen, X. (2015). AMAS: optimizing the partition and filtration of adaptive

seeds to speed up read mapping. arXiv preprint arXiv:1502.05041.

Holtgrewe, M., Emde, A.-K., Weese, D., and Reinert, K. (2011). A novel and well-defined bench-

marking method for second generation read mapping. BMC Bioinformatics, 12(1):210.

Huang, Y.-F., Chen, S.-C., Chiang, Y.-S., Chen, T.-H., and Chiu, K.-P. (2012). Palindromic se-

quence impedes sequencing-by-ligation mechanism. BMC Systems Biology, 6(Suppl 2):S10.

Hyyrö, H. (2003). A bit-vector algorithm for computing Levenshtein and Damerau edit dis-

tances. Nordic Journal of Computing, 10(1):29–39.

74

Bibliography

Illumina, Inc. (2015). Sequencing Platform Comparison Tool. <https://www.illumina.com/

systems/sequencing-platform-comparison.html> Accessed 20.August.2015.

Jiang, H. and Wong, W. H. (2008). SeqMap: mapping massive amount of oligonucleotides to the

genome. Bioinformatics, 24(20):2395–2396.

Kemp, M. (2003). The Mona Lisa of modern science. Nature, 421(6921):416–420.

Kiełbasa, S. M., Wan, R., Sato, K., Horton, P., and Frith, M. C. (2011). Adaptive seeds tame gen-

omic sequence comparison. Genome Research, 21(3):487–493.

Kim, J., Li, C., and Xie, X. (2014). Improving read mapping using additional prefix grams. BMC

Bioinformatics, 15(1):42.

Kunde-Ramamoorthy, G., Coarfa, C., Laritsky, E., Kessler, N. J., Harris, R. A., Xu, M., Chen, R.,

Shen, L., Milosavljevic, A., and Waterland, R. A. (2014). Comparison and quantitative verific-

ation of mapping algorithms for whole-genome bisulfite sequencing. Nucleic Acids Research,

42(6):e43–e43.

Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar,

K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human gen-

ome. Nature, 409(6822):860–921.

Langmead, B. and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature

Methods, 9(4):357–359.

Langmead, B., Schatz, M. C., Lin, J., Pop, M., and Salzberg, S. L. (2009a). Searching for SNPs with

cloud computing. Genome Biology, 10(11):R134.

Langmead, B., Trapnell, C., Pop, M., Salzberg, S. L., et al. (2009b). Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25.

Ledergerber, C. and Dessimoz, C. (2011). Base-calling for next-generation sequencing plat-

forms. Briefings in Bioinformatics, page bbq077.

Lee, W.-P., Stromberg, M. P., Ward, A., Stewart, C., Garrison, E. P., and Marth, G. T. (2014). MO-

SAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Map-

ping. PLoS ONE, 9:e90581.

Levene, P. and London, E. (1929). The structure of thymonucleic acid. Journal of Biological

Chemistry, 83(3):793–802.

Levenshtein, V. (1966). Binary Codes Capable of Correcting Deletions, Insertions and Reversals.

75

https://www.illumina.com/systems/sequencing-platform-comparison.html

https://www.illumina.com/systems/sequencing-platform-comparison.html

Bibliography

In Soviet Physics Doklady, volume 10, page 707.

Li, H. (2012). Exploring single-sample SNP and INDEL calling with whole-genome de novo as-

sembly. Bioinformatics, 28(14):1838–1844.

Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.

arXiv preprint arXiv:1303.3997.

Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler

transform. Bioinformatics, 25(14):1754–1760.

Li, H. and Durbin, R. (2010). Fast and accurate long-read alignment with Burrows–Wheeler

transform. Bioinformatics, 26(5):589–595.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G.,

Durbin, R., et al. (2009a). The Sequence Alignment/Map format and SAMtools. Bioinformat-

ics, 25(16):2078–2079.

Li, H. and Homer, N. (2010). A survey of sequence alignment algorithms for next-generation

sequencing. Briefings in Bioinformatics, 11(5):473–483.

Li, H., Ruan, J., and Durbin, R. (2008a). Mapping short DNA sequencing reads and calling vari-

ants using mapping quality scores. Genome Research, 18(11):1851–1858.

Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008b). SOAP: short oligonucleotide alignment pro-

gram. Bioinformatics, 24(5):713–714.

Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009b). SOAP2: an

improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966–1967.

Liu, C.-M., Wong, T., Wu, E., Luo, R., Yiu, S.-M., Li, Y., Wang, B., Yu, C., Chu, X., Zhao, K., et al.

(2012a). SOAP3: ultra-fast GPU-based parallel alignment tool for short reads. Bioinformatics,

28(6):878–879.

Liu, L., Li, Y., Li, S., Hu, N., He, Y., Pong, R., Lin, D., Lu, L., and Law, M. (2012b). Comparison of

Next-Generation Sequencing Systems. Journal of Biomedicine and Biotechnology, 2012.

Liu, Y., Popp, B., and Schmidt, B. (2014). CUSHAW3: Sensitive and Accurate Base-Space and

Color-Space Short-Read Alignment with Hybrid Seeding. PLoS ONE, 9(1).

Liu, Y. and Schmidt, B. (2012). Long read alignment based on maximal exact match seeds. Bioin-

formatics, 28(18):i318–i324.

Liu, Y., Schmidt, B., and Maskell, D. L. (2012c). CUSHAW: a CUDA compatible short read aligner

76

Bibliography

to large genomes based on the Burrows–Wheeler transform. Bioinformatics, 28(14):1830–

1837.

Luo, R., Wong, T., Zhu, J., Liu, C., Zhu, X., Leung, F. C., et al. (2013). SOAP3-dp: Fast, Accurate

and Sensitive GPU-Based Short Read Aligner. PLoS ONE, 8(5):e65632.

Manber, U. and Myers, G. (1993). Suffix arrays: a new method for on-line string searches. siam

Journal on Computing, 22(5):935–948.

Maxam, A. M. and Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the

National Academy of Sciences, 74(2):560–564.

McPherson, J. D. (2014). A defining decade in DNA sequencing. Nature Methods, 11(10):1003–

1005.

Mendel, G. (1866). Versuche über Pflanzenhybriden. Verhandlungen des naturforschenden Ver-

eines in Brunn 4: 3, 44.

Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature Reviews Genetics,

11(1):31–46.

Minoche, A. E., Dohm, J. C., Himmelbauer, H., et al. (2011). Evaluation of genomic high-

throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems.

Genome Biology, 12(11):R112.

Myers, G. (1999). A fast bit-vector algorithm for approximate string matching based on dynamic

programming. Journal of the ACM (JACM), 46(3):395–415.

NCBI (2015). SRA Toolkit Documentation. <http://www.ncbi.nlm.nih.gov/Traces/sra/?view=

toolkit_doc&f=fastq-dump> Accessed 12.August.2015.

Needleman, S. B. and Wunsch, C. D. (1970). A General Method Applicable to the Search for Sim-

ilarities in the Amino Acid Sequence of Two Proteins. Journal of Molecular Biology, 48(3):443–

453.

Newkirk, D., Biesinger, J., Chon, A., Yokomori, K., and Xie, X. (2011). AREM: Aligning Short

Reads from ChIP-Sequencing by Expectation Maximization. In Research in Computational

Molecular Biology, pages 283–297. Springer.

Nguyen, T., Shi, W., and Ruden, D. (2011). CloudAligner: A fast and full-featured MapReduce

based tool for sequence mapping. BMC Research Notes, 4(1):171.

Nirenberg, M., Leder, P., Bernfield, M., Brimacombe, R., Trupin, J., Rottman, F., and O’neal, C.

77

http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump

http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=fastq-dump

Bibliography

(1965). RNA codewords and protein synthesis, VII. On the general nature of the RNA code.

Proceedings of the National Academy of Sciences of the United States of America, 53(5):1161.

Nobelprize.org (2015a). The Nobel Prize in Chemistry 1980. <http://www.nobelprize.org/

nobel_prizes/chemistry/laureates/1980/> Accessed 16.August.2015.

Nobelprize.org (2015b). The Nobel Prize in Physiology or Medicine 1962. <http://www.

nobelprize.org/nobel_prizes/medicine/laureates/1962> Accessed 20.March.2015.

Nordberg, H., Bhatia, K., Wang, K., and Wang, Z. (2013). BioPig: a Hadoop-based analytic toolkit

for large-scale sequence data. Bioinformatics, page btt528.

O|B|F (2015). Open Bioinformatics Foundation. <http://www.open-bio.org/> Accessed 12.Au-

gust.2015.

O’Driscoll, A., Daugelaite, J., and Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing

in genomics. Journal of Biomedical Informatics, 46(5):774–781.

Offit, K. (2014). Decade in review – genomics: A decade of discovery in cancer genomics. Nature

Reviews Clinical Oncology, 11(11):632–634.

Olson, M. (2010). HADOOP: Scalable, Flexible Data Storage and Analysis. IQT Quart, 1(3):14–18.

Onsongo, G., Erdmann, J., Spears, M. D., Chilton, J., Beckman, K. B., Hauge, A., Yohe, S., Scho-

maker, M., Bower, M., Silverstein, K. A., et al. (2014). Implementation of Cloud based Next

Generation Sequencing data analysis in a clinical laboratory. BMC Research Notes, 7(1):314.

Oracle Corporation (2015a). Class AbstractMap.SimpleEntry<K,V>. <http://docs.oracle.com/

javase/7/docs/api/java/util/AbstractMap.SimpleEntry.html> Accessed 27.August.2015.

Oracle Corporation (2015b). Class ArrayList<E>. <https://docs.oracle.com/javase/7/docs/api/

java/util/ArrayList.html> Accessed 27.August.2015.

Oracle Corporation (2015c). Class Class<T>. <http://docs.oracle.com/javase/7/docs/api/java/

lang/Class.html> Accessed 5.September.2015.

Oracle Corporation (2015d). Class Constructor<T>. <https://docs.oracle.com/javase/7/docs/

api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object...)> Accessed 28.Au-

gust.2015.

Oracle Corporation (2015e). Class HashMap<K,V>. <http://docs.oracle.com/javase/7/docs/

api/java/util/HashMap.html> Accessed 28.August.2015.

Oracle Corporation (2015f). Class HashSet<E>. <http://docs.oracle.com/javase/7/docs/api/

78

http://www.nobelprize.org/nobel_prizes/chemistry/laureates/1980/

http://www.nobelprize.org/nobel_prizes/chemistry/laureates/1980/

http://www.nobelprize.org/nobel_prizes/medicine/laureates/1962

http://www.nobelprize.org/nobel_prizes/medicine/laureates/1962

http://www.open-bio.org/

http://docs.oracle.com/javase/7/docs/api/java/util/AbstractMap.SimpleEntry.html

http://docs.oracle.com/javase/7/docs/api/java/util/AbstractMap.SimpleEntry.html

https://docs.oracle.com/javase/7/docs/api/java/util/ArrayList.html

https://docs.oracle.com/javase/7/docs/api/java/util/ArrayList.html

http://docs.oracle.com/javase/7/docs/api/java/lang/Class.html

http://docs.oracle.com/javase/7/docs/api/java/lang/Class.html

https://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object...)

https://docs.oracle.com/javase/7/docs/api/java/lang/reflect/Constructor.html#newInstance(java.lang.Object...)

http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html

http://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html

http://docs.oracle.com/javase/7/docs/api/java/util/HashSet.html


Bibliography

java/util/HashSet.html> Accessed 2.September.2015.

Oracle Corporation (2015g). Class Object. <http://docs.oracle.com/javase/7/docs/api/java/

lang/Object.html> Accessed 5.September.2015.

Oracle Corporation (2015h). Class Properties. <http://docs.oracle.com/javase/7/docs/api/

java/util/Properties.html> Accessed 4.September.2015.

Oracle Corporation (2015i). Class StringBuffer. <http://docs.oracle.com/javase/7/docs/api/

java/lang/StringBuffer.html> Accessed 28.August.2015.

Oracle Corporation (2015j). Class TreeSet<E>. <http://docs.oracle.com/javase/7/docs/api/

java/util/TreeSet.html> Accessed 5.September.2015.

Oracle Corporation (2015k). Interface Comparable<T>. <http://docs.oracle.com/javase/7/

docs/api/java/lang/Comparable.html> Accessed 5.September.2015.

Oracle Corporation (2015l). Interface Runnable. <https://docs.oracle.com/javase/7/docs/api/

java/lang/Runnable.html> Accessed 4.September.2015.

Oracle Corporation (2015m). The JavaTM Tutorials - Abstract Methods and Classes. <https://

docs.oracle.com/javase/tutorial/java/IandI/abstract.html> Accessed 17.August.2015.

Oracle Corporation (2015n). The JavaTM Tutorials - Classes. <https://docs.oracle.com/javase/

tutorial/java/javaOO/classes.html> Accessed 17.August.2015.

Oracle Corporation (2015o). The JavaTM Tutorials - Creating and Using Packages. <https://docs.

oracle.com/javase/tutorial/java/package/packages.html> Accessed 28.August.2015.

Oracle Corporation (2015p). The JavaTM Tutorials - Lesson: A Closer Look at the "Hello World!"

Application. <https://docs.oracle.com/javase/tutorial/getStarted/application/#MAIN> Ac-

cessed 28.August.2015.

Oracle Corporation (2015q). The JavaTM Tutorials - Objects. <https://docs.oracle.com/javase/

tutorial/java/javaOO/objects.html> Accessed 17.August.2015.

Oracle Corporation (2015r). The JavaTM Tutorials - Thread Pools. <http://docs.oracle.com/

javase/tutorial/essential/concurrency/pools.html> Accessed 4.September.2015.

O’Rawe, J., Jiang, T., Sun, G., Wu, Y., Wang, W., Hu, J., Bodily, P., Tian, L., Hakonarson, H., John-

son, W. E., et al. (2013). Low concordance of multiple variant-calling pipelines: practical im-

plications for exome and genome sequencing. Genome Medicine, 5(3):28.

Pak, T. and Kasarskis, A. (2015). How next-generation sequencing and multiscale data analysis

79



http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html

http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html

http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html

http://docs.oracle.com/javase/7/docs/api/java/util/Properties.html

http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuffer.html

http://docs.oracle.com/javase/7/docs/api/java/lang/StringBuffer.html

http://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html

http://docs.oracle.com/javase/7/docs/api/java/util/TreeSet.html

http://docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html

http://docs.oracle.com/javase/7/docs/api/java/lang/Comparable.html

https://docs.oracle.com/javase/7/docs/api/java/lang/Runnable.html

https://docs.oracle.com/javase/7/docs/api/java/lang/Runnable.html

https://docs.oracle.com/javase/tutorial/java/IandI/abstract.html

https://docs.oracle.com/javase/tutorial/java/IandI/abstract.html

https://docs.oracle.com/javase/tutorial/java/javaOO/classes.html

https://docs.oracle.com/javase/tutorial/java/javaOO/classes.html

https://docs.oracle.com/javase/tutorial/java/package/packages.html

https://docs.oracle.com/javase/tutorial/java/package/packages.html

https://docs.oracle.com/javase/tutorial/getStarted/application/#MAIN

https://docs.oracle.com/javase/tutorial/java/javaOO/objects.html

https://docs.oracle.com/javase/tutorial/java/javaOO/objects.html

http://docs.oracle.com/javase/tutorial/essential/concurrency/pools.html

http://docs.oracle.com/javase/tutorial/essential/concurrency/pools.html

Bibliography

will transform infectious disease management. Clinical Infectious Diseases, page civ670.

Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison.

Proceedings of the National Academy of Sciences, 85(8):2444–2448.

Pettersson, E., Lundeberg, J., and Ahmadian, A. (2009). Generations of sequencing technologies.

Genomics, 93(2):105–111.

Prober, J. M., Trainor, G. L., Dam, R. J., Hobbs, F. W., Robertson, C. W., Zagursky, R. J., Cocuzza,

A. J., Jensen, M. A., and Baumeister, K. (1987). A system for rapid DNA sequencing with fluor-

escent chain-terminating dideoxynucleotides. Science, 238(4825):336–341.

Quail, M. A., Smith, M., Coupland, P., Otto, T. D., Harris, S. R., Connor, T. R., Bertoni, A., Swerdlow,

H. P., and Gu, Y. (2012). A tale of three next generation sequencing platforms: comparison of

Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics, 13(1):341.

Reinert, K., Langmead, B., Weese, D., and Evers, D. J. (2015). Alignment of Next-Generation

Sequencing Reads. Annual Review of Genomics and Human Genetics.

Roberts, A., Feng, H., and Pachter, L. (2013). Fragment assignment in the cloud with eXpress-D.

BMC Bioinformatics, 14(1):358.

Roche Diagnostics Corporation (2015). 454 Products. <http://454.com/products/index.asp>

Accessed 20.August.2015.

Ross, J. S. and Cronin, M. (2011). Whole Cancer Genome Sequencing by Next-Generation Meth-

ods. American Journal of Clinical Pathology, 136(4):527–539.

Rumble, S., Lacroute, P., Dalca, A., Fiume, M., Sidow, A., and Brudno, M. (2009). SHRiMP: accur-

ate mapping of short color-space reads. PLoS Computational Biology, 5(5):e1000386.

Sanger, F., Air, G., Barrell, B., Brown, N., Coulson, A., Fiddes, J., Hutchison, C., Slocombe,

P., and Smith, M. (1977a). Nucleotide sequence of bacteriophage ϕX174 DNA. Nature,

265(5596):687–695.

Sanger, F., Nicklen, S., and Coulson, A. R. (1977b). DNA sequencing with chain-terminating

inhibitors. Proceedings of the National Academy of Sciences, 74(12):5463–5467.

Schulz, M. H., Weese, D., Holtgrewe, M., Dimitrova, V., Niu, S., Reinert, K., and Richard, H. (2014).

Fiona: a parallel and automatic strategy for read error correction. Bioinformatics, 30(17):i356–

i363.

Shendure, J. and Ji, H. (2008). Next-generation DNA sequencing. Nature Biotechnology,

80

http://454.com/products/index.asp

Bibliography

26(10):1135–1145.

Siragusa, E., Weese, D., and Reinert, K. (2013). Fast and accurate read mapping with approximate

seeds and multiple backtracking. Nucleic Acids Research, 41(7):e78.

Smith, A. D., Chung, W.-Y., Hodges, E., Kendall, J., Hannon, G., Hicks, J., Xuan, Z., and

Zhang, M. Q. (2009). Updates to the RMAP short-read mapping software. Bioinformatics,

25(21):2841–2842.

Smith, A. D., Xuan, Z., and Zhang, M. Q. (2008). Using quality scores and longer reads improves

accuracy of Solexa read mapping. BMC Bioinformatics, 9(1):128.

Smith, L. M., Sanders, J. Z., Kaiser, R. J., Hughes, P., Dodd, C., Connell, C. R., Heiner, C., Kent,

S. B., and Hood, L. E. (1986). Fluorescence detection in automated DNA sequence analysis.

Nature, 321(6071):674–679.

Smith, T. F. and Waterman, M. S. (1981). Identification of Common Molecular Subsequences.

Journal of Molecular Biology, 147(1):195–197.

Staden, R. (1979). A strategy of DNA sequencing employing computer programs. Nucleic Acids

Research, 6(7):2601–2610.

The Apache Software Foundation (2015a). Apache SparkTM. <http://spark.apache.org/> Ac-

cessed 17.September.2015.

The Apache Software Foundation (2015b). ApacheTM Hadoop®. <http://hadoop.apache.org/>

Accessed 17.September.2015.

The New York Times (2007). Statement by James D. Watson. <http://www.nytimes.com/2007/

10/25/science/26wattext.html?_r=0> Accessed 20.August.2015.

Thermo Fisher Scientific Inc. (2015). SOLiD® Next-Generation Sequencing. <https://www.

thermofisher.com/pt/en/home/life-science/sequencing/next-generation-sequencing/

solid-next-generation-sequencing.html> Accessed 20.August.2015.

Treangen, T. J. and Salzberg, S. L. (2012). Repetitive DNA and next-generation sequencing: com-

putational challenges and solutions. Nature Reviews Genetics, 13(1):36–46.

Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell,

M., Evans, C. A., Holt, R. A., et al. (2001). The sequence of the human genome. Science,

291(5507):1304–1351.

Wang, Q., Xia, J., Jia, P., Pao, W., and Zhao, Z. (2013). Application of next generation sequencing

81

http://spark.apache.org/

http://hadoop.apache.org/

http://www.nytimes.com/2007/10/25/science/26wattext.html?_r=0

http://www.nytimes.com/2007/10/25/science/26wattext.html?_r=0

https://www.thermofisher.com/pt/en/home/life-science/sequencing/next-generation-sequencing/solid-next-generation-sequencing.html



Bibliography

to human gene fusion detection: computational tools, features and perspectives. Briefings in

Bioinformatics, 14(4):506–519.

Watson, J. D. (1990). The Human Genome Project: Past, Present, and Future. Science,

248(4951):44–49.

Watson, J. D. and Crick, F. H. C. (1953a). Molecular Structure of Nucleic Acids. Nature,

171(4356):737–738.

Watson, J. D. and Crick, F. H. C. (1953b). The structure of DNA. In Cold Spring Harbor Symposia

on Quantitative Biology, volume 18, pages 123–131. Cold Spring Harbor Laboratory Press.

Weese, D., Emde, A.-K., Rausch, T., Döring, A., and Reinert, K. (2009). RazerS—fast read mapping

with sensitivity control. Genome Research, 19(9):1646–1654.

Weese, D., Holtgrewe, M., and Reinert, K. (2012). RazerS 3: faster, fully sensitive read mapping.

Bioinformatics, 28(20):2592–2599.

Wetterstrand, K. A. (2015). DNA Sequencing Costs: Data from the NHGRI Genome Sequencing

Program (GSP). <www.genome.gov/sequencingcosts> Accessed 15.July.2015.

Wiewiórka, M. S., Messina, A., Pacholewska, A., Maffioletti, S., Gawrysiak, P., and Okoniewski,

M. J. (2014). SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data

analysis with nucleotide precision. Bioinformatics, 30(18):2652–2653.

Wilton, R., Budavari, T., Langmead, B., Wheelan, S. J., Salzberg, S. L., and Szalay, A. S. (2015).

Arioc: high-throughput read alignment with GPU-accelerated exploration of the seed-and-

extend search space. PeerJ, 3:e808.

Yang, X., Chockalingam, S. P., and Aluru, S. (2013). A survey of error-correction methods for

next-generation sequencing. Briefings in Bioinformatics, 14(1):56–66.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: cluster

computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in

cloud computing, volume 10, page 10.

Zhang, J., Chiodini, R., Badr, A., and Zhang, G. (2011). The impact of next-generation sequencing

on genomics. Journal of Genetics and Genomics, 38(3):95–109.

Zhao, G., Ling, C., and Sun, D. (2015). SparkSW: Scalable Distributed Computing System for

Large-Scale Biological Sequence Alignment. In Cluster, Cloud and Grid Computing (CCGrid),

2015 15th IEEE/ACM International Symposium on, pages 845–852. IEEE.

82

www.genome.gov/sequencingcosts