157
Processing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Embed Size (px)

Citation preview

Page 1: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Processing of Raw Genome Data

Kristina Kirsten, Torben Meyer

Trends in Bioinformatics

Hasso-Plattner-Institut

Page 2: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Session Agenda

1. Why Sequence the Human Genome?

2. Problem Statement

■ Genetic Basics

■ Sanger Sequencing vs. NGS

3. Sequencing Pipeline

■ Base Calling

■ Alignment

■ Variant Calling

■ Data Annotation

4. Analysis Results / Use Cases

5. Outlook

6. Discussion

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 2

Page 3: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Why Sequence the Human Genome?

Understand mutations

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 3[1] http://evolution.berkeley.edu/evolibrary/images/evo/dna-mutation.gif

[2] http://informoverload.com/man-has-extra-fingers/

[2][1]

Page 4: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Why Sequence the Human Genome?

Identify marker for diseases

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 4

http://hmg.oxfordjournals.org/content/18/R1/R48/F1.large.jpg

Page 5: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Why Sequence the Human Genome?

Take personalized treatment decisions

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 5

http://www.alphagenomix.com/wp-content/uploads/2014/03/Figure21.png

Page 6: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Why Sequence the Human Genome?

Take personalized treatment decisions

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 6

Page 7: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Genetic Basics (1)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 7

Page 8: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Genetic Basics (2)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 8

• Base pairs:

Adenine (A) & Thymine (T)

Guanine (G) & Cytosine (C)

Genes = specific parts of DNA

Allele = one specific form of a gene

http://study.com/cimages/multimages/16/phenotype_v_genotype.png

Page 9: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

1,000 Genomes Project

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 9

Page 10: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Reference Genome

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 10

Page 11: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Sanger Sequencing vs. NGS

Developed in 1977

Small parallelization

Few but long reads

400 to 900 bases per read

Low error rate

1Kb with per-base error rate

<0.001%

Low amount of data per run

100 min for sequencing 1Kb

expensive: $ 0.5 per Kb

Introduced in 2004

High parallelization

Many short reads

50 to 300 bases per read

High error rate

0.5 - 1.0% per-base error rate

High amount of data per run

Up to 6 billion reads

0.002 min for sequencing 1Kb

affordable: $ 0.00005 per Kb

Sanger Sequencing

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 11

Next Generation Sequencing (NGS)Sanger Sequencing

Page 12: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

3. Sequencing Pipeline

Page 13: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Next-Generation Sequencing Bioinformatics Pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 13

This presentation

Page 14: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

3.1 Base Calling

Page 15: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Receives initial raw data

□ Image Filters, fluctuations in current, …

□ From Roche/454, Illumina, SOLiD, Helios, …

■ Calls nucleotide bases (A, C, G, T) in short strings

□ A few giga base-pairs (Gbp) per machine-day

□ Usually 50 - 300 bases long

■ Base calls get quality score assigned

Base Calling

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 16

Page 16: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Excursion – Phred Quality Scores

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 17

1. Probabilities of base call errors very small

■ Need to be mapped to values that are easier to compare

𝑄 = −10 ∗ log10 𝑃

2. Q: Quality Score, P: Probability of Base Call Error

3. Are calculated in the base calling phase

Page 17: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Excursion – Phred Quality Scores

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 18

Quality Score Probabilty of Base Call Error

Base Call Accuracy

5 Ca. 1 in 3 69 %

10 1 in 10 90 %

20 1 in 100 99 %

30 1 in 1,000 99.9 %

40 1 in 10,000 99.99 %

50 1 in 100,000 99.999 %

60 1 in 1,000,000 99.9999 %

Source: https://en.wikipedia.org/wiki/Phred_quality_score

Page 18: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

3.2 Alignment

Page 19: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Alignment in the Sequencing Pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 20

Page 20: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ High error rates in NGS reads

■ High number of short reads

■ Mutations and Variations can occur

■ Gaps, e.g. Indels (Insertions and Deletions)

■ Bases could be missing / new bases could have been inserted

Challenges

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 21

Page 21: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Using hash tables

□ BLAST

□ SeqMap

□ MAQ

■ Suffix or prefix tries (e.g. using an FM-Index)

□ Bowtie and Bowtie 2

□ BWA, BWA-SW

□ SOAP, SOAP2, SOAP3

Alignment Algorithms

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 22

Page 22: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Using hash tables

□ BLAST

□ SeqMap

□ MAQ

■ Suffix or prefix tries (e.g. using an FM-Index)

□ Bowtie and Bowtie 2

□ BWA, BWA-SW

□ SOAP, SOAP2, SOAP3

Alignment Algorithms

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 23

Page 23: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Based on Bowtie

■ Allows

□ Ungapped alignment

□ Gapped alignment (containing indels)

□ Inexact matching

■ Technologies

□ FM supported Index, using BWT

□ SIMD accelerated dynamic programming (Smith-Waterman)

■ Input: FASTQ file of reads

■ Output: SAM file of aligned reads

Bowtie 2 – Overview

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 24

Page 24: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Four lines per read

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 26

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Page 25: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Four lines per read

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 27

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Page 26: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Four lines per read

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

■ Sequence line

□ IUPAC single letter, uppercase

codes for DNA or RNA

(G, T, C, A)

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 28

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Page 27: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Four lines per read

■ ‚@‘ title-line

□ Record identifier

□ Other commentary

■ Sequence line

□ IUPAC single letter, uppercase

codes for DNA or RNA

(G, T, C, A)

■ ‚+‘-line

□ Marks end of Sequence line

□ Sometimes contains copy of

‚@‘-line

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 29

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Page 28: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Four lines per read

■ ‚@‘ title-line

■ Sequence line

■ ‚+‘-line

■ Quality line

□ Allows PHRED quality scores

from 0 to 93

□ Display as ASCII codes 33 –

126

One char per score

□ Same length as sequence line

FASTQ-SANGER

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 30

@SRR014849.1 EIXKN4201CFU84 length=93

GGGGGGGGGGGGGGGGCTTTTTTTGTTTGGAACCGAAAGG

GTTTTGAATTTCAAACCCTTTTCGGTTTCCAACCTTCCAA

GCAATGCCAATA

+SRR014849.1 EIXKN4201CFU84 length=93

3+&$#"""""""""""7F@71,’";C?,B;?6B;:EA1EA

1EA5’9B:?:#9EA0D@2EA5’:>5?:%A;A8A;?9B;D@

/=<?7=9<2A8==

Page 29: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 31

■ Input A A T T C G A

Page 30: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 32

■ Input

■ Append EOF char

A A T T C G A $

Page 31: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 33

■ Input

■ Append EOF char

■ Shift right

A A T T C G A $

$ A A T T C G A

A $ A A T T C G

G A $ A A T T C

C G A $ A A T T

T C G A $ A A T

T T C G A $ A A

A T T C G A $ A

Page 32: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 34

$ A A T T C G A

A $ A A T T C G

A A T T C G A $

A T T C G A $ A

C G A $ A A T T

G A $ A A T T C

T C G A $ A A T

T T C G A $ A A

■ Input

■ Append EOF char

■ Shift right

■ Sort

lexicographically

Page 33: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Burrows-Wheeler-Transformation (BWT)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 35

■ Input

■ Append EOF char

■ Shift right

■ Sort

lexicographically

■ Return last column

AATTCGA$

AG$ATCTA

$ A A T T C G A

A $ A A T T C G

A A T T C G A $

A T T C G A $ A

C G A $ A A T T

G A $ A A T T C

T C G A $ A A T

T T C G A $ A A

Page 34: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 36

1. We only know the first row

(sorting of the letters) and the

last row$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 35: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 37

1. We only know the first row

(sorting of the letters) and the

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 36: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 38

1. We only know the first row

(sorting of the letters) and the

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

3. We start with the first letter in

the last column and apply that

property to unpermute the

sentence

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 37: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 39

1. We only know the first row

(sorting of the letters) and the

last row

2. Last-First-Property:

The order of same letters is not

getting mixed up in the last and

first column

3. We start with the first letter in

the last column and apply that

property to unpermute the

sentence

4. This letter equals the last letter of

the original sentence

A

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 38: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 40

5. Apply LF-Mapping property until $

is reached

GA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 39: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 41

5. Apply LF-Mapping property until $

is reached

CGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 40: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 42

5. Apply LF-Mapping property until $

is reached

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 41: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 43

5. Apply LF-Mapping property until $

is reached

TTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 42: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 44

5. Apply LF-Mapping property until $

is reached

ATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 43: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 45

5. Apply LF-Mapping property until $

is reached

AATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 44: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT - Unpermute

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 46

5. Apply LF-Mapping property until $

is reached

6. Original sentence is reconstructed

AATTCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 45: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 47

Let‘s search for ATT in original string!

Page 46: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 48

1. Write output of BWT to last

column

. . . . . . . A

. . . . . . . G

. . . . . . . $

. . . . . . . A

. . . . . . . T

. . . . . . . C

. . . . . . . T

. . . . . . . A

ATT

Page 47: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 49

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

ATT

Page 48: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 50

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 49: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 51

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 50: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 52

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

ATT

1.

2.

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 51: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 53

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 52: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 54

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

7. Repeat

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 53: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 55

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

7. Repeat

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 54: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 56

1. Write output of BWT to last

column

2. Sort letters lexicographicly and

write to first column

3. Mark all occurences of the last

letter in the first column

4. Check for occurence of the next

letter in the last column

5. Count which occurence of T

was found

6. Mark that occurence in the first

column

7. Repeat

ATT

1.

3.

2.

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 55: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Exact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 57

FOUND!

ATT

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 56: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 58

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 57: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 59

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 58: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 60

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 59: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 61

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 60: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 62

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 61: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 63

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 62: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 64

8. Now we need to unwind

to find the position in the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 63: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Unwind

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 65

8. Now we need to unwind

to find the position in the

original string

9. End of the string is found

position of the searched string:

first letter is the 6th last letter

of the sentence ≜ 2nd letter of the

original string

AATTCGA$

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 64: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 66

But what about inexact matches?

What if we want to find TCAA in the original string?

Page 65: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 67

Quality Scores are assigned in the Base Calling phase.

T C A A

65 47 10 50

Page 66: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 68

1. Proceed as in exact matching

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

TCAA

Page 67: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 69

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 68: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 70

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 69: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 71

1. Proceed as in exact matching TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 70: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 72

1. Proceed as in exact matching

2. No match found for next C

What now?

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 71: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 73

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 72: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 74

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 73: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 75

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

TCAA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 74: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 76

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 75: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 77

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 76: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 78

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 77: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 79

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 78: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 80

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 79: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 81

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 80: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 82

1. Proceed as in exact matching

2. No match found for next C

3. Look at base calls and pick

position in stack with lowest

quality score

■ Second A had score of 10

4. Walk back in stack and replace

with different possible base

(e.g, G)

5. Continue with exact matching and

repeat until match found

TCGA

!

$ . . . . . . A

A . . . . . . G

A . . . . . . $

A . . . . . . A

C . . . . . . T

G . . . . . . C

T . . . . . . T

T . . . . . . A

Page 81: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

BWT – Inexact Matching

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 83

We found an alignment for TCAA withone mismatch.

Page 82: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Gapped Alignment

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 85

What about gapped alignment (e.g. Indels)?

Let‘s look at the Bowtie 2 Pipeline.

Page 83: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Bowtie 2 – 1. Extract Seeds

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 86

Extract seeds from each read

Follow a certain policy (e.g. 16 base substring every 10 bases along the read)

Upcoming process works with the seed strings

Read:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Page 84: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Bowtie 2 – 1. Extract Seeds

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 87

Extract seeds from each read

Follow a certain policy (e.g. 16 nt substring every 10 nt along the read)

Upcoming process works with the seed strings

Read:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Page 85: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Bowtie 2 – 2. Align with FM Index

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 88

■ Find alignments using the BWT as described earlier

□ There may be more than one possible alignment per seed!

■ Returns range of possibe alignments within the FM Index

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

BWT, (In)exact matching

{ [211, 212];[212, 214] }

Page 86: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Every Alignment gets priority 1 𝑟2

□ r = total number of alignments for the seed

Seeds with fewer alignments get higher priority

■ Randomly select seeds, weighted by priority

□ Run dynamic programming approach on these

□ Modified Smith-Waterman algorithm

Bowtie 2 – 3. Prioritize, Resolve

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 89

Page 87: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 90

Page 88: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

■ Read:

GCTCAG

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 91

Page 89: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

■ Read:

GCTCAG

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 92

Page 90: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Reference Genome:

CCAGTAGCTCTCAGCCTTAATTTTACCCAGGCCTGTA

■ Read:

GCTCAG

■ One exact match for this seed found!

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 93

Page 91: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Uses Dynamic Programming approach

□ Calculate larger problem with first calculating smaller problems

□ Base larger problem then on the results of the smaller problems

□ Fill out table to save all results

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 94

Page 92: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Give an „Award“ if match is found

■ Give a „Penalty“ if no match was found (here is a gap!)

■ Table cell with the highest score is the best alignment

■ Let‘s align GCTCAG to GCTCTCAG!

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 95

Page 93: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 96

- G C T C T C A G

-

G

C

T

C

A

G

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Page 94: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 97

- G C T C T C A G

- 0

G

C

T

C

A

G

Page 95: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 98

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0

C 0

T 0

C 0

A 0

G 0

Page 96: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = −𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 99

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0

C 0

T 0

C 0

A 0

G 0

Page 97: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 100

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2

C 0

T 0

C 0

A 0

G 0

Page 98: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 101

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2

C 0

T 0

C 0

A 0

G 0

Parallelizable!

Page 99: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 102

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1

C 0 1

T 0

C 0

A 0

G 0

Page 100: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 103

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0

C 0 1 4

T 0 0

C 0

A 0

G 0

Page 101: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 104

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0

C 0 1 4 3

T 0 0 3

C 0 0

A 0

G 0

Page 102: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 105

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0

C 0 1 4 3 2

T 0 0 3 6

C 0 0 2

A 0 0

G 0

Page 103: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 106

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0

C 0 1 4 3 2 1

T 0 0 3 6 5

C 0 0 2 5

A 0 0 1

G 0 2

Page 104: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 107

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0

C 0 1 4 3 2 1 2

T 0 0 3 6 2 4

C 0 0 2 2 8

A 0 0 1 1

G 0 1 0

Page 105: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 108

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1

T 0 0 3 6 2 4 3

C 0 0 2 2 8 7

A 0 0 1 1 7

G 0 1 0 0

Page 106: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 109

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2

C 0 0 2 2 8 7 6

A 0 0 1 1 7 7

G 0 1 0 0 6

Page 107: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 110

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5

A 0 0 1 1 7 7 6

G 0 1 0 0 6 6

Page 108: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 111

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8

G 0 1 0 0 6 6 6

Page 109: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 112

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7

Page 110: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 113

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

Page 111: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

𝐻 𝑖, 𝑗 = 𝑚𝑎𝑥

0𝐻 𝑖 − 1, 𝑗 − 1 + 𝑤 𝑎𝑖 , 𝑏𝑗𝐻 𝑖 − 1, 𝑗 + 𝑤 𝑎𝑖 , −

𝐻 𝑖, 𝑗 − 1 + 𝑤 −, 𝑏𝑗

𝑤 𝑎, 𝑏 = −1, 𝑎 ≠ 𝑏 𝑜𝑟 𝑎 = − 𝑜𝑟 𝑏 = −

+2, 𝑎 = 𝑏

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 114

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

Page 112: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 115

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

■ Traceback from best value

■ We need to remember howwe calculated the values forthat!

Page 113: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Smith-Waterman

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 116

- G C T C T C A G

- 0 0 0 0 0 0 0 0 0

G 0 2 1 0 0 0 0 0 2

C 0 1 4 3 2 1 2 1 1

T 0 0 3 6 2 4 3 2 1

C 0 0 2 2 8 7 6 5 4

A 0 0 1 1 7 7 6 8 7

G 0 1 0 0 6 6 6 7 10

■ Backtracing from best value

□ We need to remember howwe calculated the valuesfor that!

■ Alignment:

G C T C T C A G

G C - - T C A G

Page 114: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Smith-Waterman ends, when ..

□ All possible alignments are examined

□ Enough alignments are examined

□ Dynamic programming limit is reached

■ Pick the alignment with the highest score in the Smith-Waterman

algorithm

Results

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 117

Page 115: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Different gap penalties for Starting and Extending a gap

■ Restrictions on where gaps are allowed

■ Scoring function also takes quality score into account

■ Reseeding, if no proper matches were found

Adjustments to Smith-Waterman in Bowtie2

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 118

Page 116: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

@SQ SN:ref LN:45

r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAAGGATACTA *

r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *

r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1

r004 0 ref 16 30 6M14N5M * 0 0 TAGCTTCAGC *

Sequence Alignment/Map format (SAM)

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 119

■ Leftmost position of the alignment

■ Quality of the alignment (Phred)

■ Matching as CIGAR string

■ Query Sequence

Page 117: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Variant Calling3.3 Variant Calling

Page 118: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Variant Calling in Pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 121

Page 119: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Genetic Variation vs. Mutation

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 122

“Genetic variation is what makes us

all unique, whether in terms of hair

colour, skin colour or even the shape of

our faces.”

“A mutation is a change that occurs in our DNA sequence, either due to mistakes when

the DNA is copied or as the result of environmental factors such as UV light and

cigarette smoke.”

vs.

Page 120: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Genetic Variation vs. Mutation

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 123

Mutations contribute to genetic variation within species

Page 121: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 124

■ Single-Nucleotide-Polymorphism (SNP)

Page 122: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 125

■ Single-Nucleotide-Polymorphism (SNP)

■ Insertions

Page 123: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 126

■ Single-Nucleotide-Polymorphism (SNP)

■ Insertions and Deletions (Indels)

Page 124: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

What means Variant Calling?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 127

■ Single-Nucleotide-Polymorphism (SNP)

■ Insertions and Deletions (Indels)

■ Larger structural variants

□ Copy Number Variation (CNV)

□ Loss of one copy of a gene or of both

□ Movement of DNA sections from one location to another

Page 125: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Early method: Counting abundance of

high-quality nucleotides at a site

■ Recent approaches:

□ Integrate several sources of information

□ Use of prior probabilities for a SNP at a

given position (e.g. dbSNP)

■ Further method: Heuristic approach

□ Specific features of different sequencing

platforms and different read alignment

methods (VarScan)

SNP Calling Methods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 128

Page 126: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Presence of Indels

■ Errors from library preparation

■ Variable quality scores with higher error rates

Identify „true“ variants and no alignment and/ or sequencing errors

Minimize the amount of false positives

Challenges with Variant Calling

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 129

Page 127: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Is a software package for analysis of HTS data

■ Offers a variety of tools: Focus on variant discovery and genotyping

■ Reads-to-variants workflow:

Genome Analysis Toolkit

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 130

Page 128: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 133

SAM / BAM File

VCF FileHaplotypeCaller

Page 129: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 134

SAM / BAM File

VCF FileHaplotypeCaller

Page 130: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 135

SAM / BAM File

VCF FileHaplotypeCaller

Page 131: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 136

SAM / BAM File

VCF FileHaplotypeCaller

Page 132: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Is capable of calling SNPs and Indels simultaneously

GATKs HaplotypeCaller

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 137

Page 133: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Go through reference genome

with a sliding window

■ Count Indels and mismatches

■ Memorize regions to operate on

HaplotypeCallerSTEP 1: Identify Active Regions

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 138

Page 134: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Local re-assembly

■ Building a De Bruijn-like graph

■ Prune according to threshold

■ Traverse graph to collect most

likely haplotypes

■ Align haplotypes to reference

genome using SWA

HaplotypeCallerSTEP 2: Assembly Plausible Haplotypes

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 139

Page 135: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Determine likelihoods of haplotypes given the read data

■ For each ActiveRegion program performs pairwise alignment of each read

against each haplotype using PairHMM algorithm

■ Produces a matrix of likelihoods of haplotypes given the read data

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 140

Page 136: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Markov Chain

P(xi | xi-1, …, x1) = (xi | xi-1) = axi-1xi

Hidden Markov Model (HMM)

akl = P(πi = l | πi-1 = k)

HaplotypeCallerRecap Statistical Models

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 141

Page 137: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 142

Page 138: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 143

Match (M) = emitting an

aligned pair with probability p

Page 139: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 144

Ix, Iy= emitting symbol xi, yi against

a gap with probability p

Page 140: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 145

All transition probabilities

leaving each state must sum to 1

Page 141: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 146

Empirical gap penalties = derived from data by BQSR

Base mismatch penalties = base quality scores

Page 142: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 3: Determine per-read Likelihoods

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 147

■ Matrix with likelihoods of the haplotypes

given the reads

Page 143: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ Assign genotypes to individual samples

based on the allele likelihoods

■ By applying Bayes' theorem to calculate

the likelihoods of each possible genotype,

and selecting the most likely

HaplotypeCallerSTEP 4: Genotype Sample

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 148

Page 144: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 4: Genotype Sample

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 149

Page 145: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Bayes Rule

HaplotypeCallerRecap Probabilistic Model

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 150

P(data)P(data|hypothesis) P(hypothesis)P(hypothesis|data) =

Page 146: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

HaplotypeCallerSTEP 4: Genotype Sample

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 151

Page 147: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

3.4 Data Annotation

Page 148: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Data Annotation in the pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 153

Page 149: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

After variant calling: where are variants

Now find out: what are these variants

Multiple thousand variants cannot be analyzed manually

Tools for automated variant annotation (e.g. ANNOVAR)

Already known SNPs can be filtered by using information from dbSNP or

the 1,000 genome project

dbSNP = Single Nucleotide Polymorphism Database

Free public archive for genetic variation

Get information from SNPs

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 154

Page 150: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

4. Analysis Results / Use Cases

Page 151: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Analysis Results in the pipeline

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 156

Page 152: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Different genome browsers (Ensembl Genome Browser, UCSC Genome

Browser, VEGA Genome Browser)

Ensembl Variant Effect Predictor (Online Tool)

Analysis Results

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 157

Page 153: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Personalized Medicine

Prescribe medicine that work for you

Detect genetic diseases early on

Learn how humans work

Why do we age? Can we switch it off?

Why are some smarter than others? Might it be genes?

Use Cases

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 158[1] http://www.tedxvienna.at/blog/personalize-this/

[1]

Page 154: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

5. Outlook

Page 155: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

Think about these Questions..

If genome sequencing is as easy and cheap like blood tests, would you:

digitize your genome?

want to know about biomarkers?

want to know about markers that identicate a disease to come?

digitize the genome of your unborn baby?

want to know the appearance of your unborn baby?

want to know that you are getting the Alzheimer disease before being 40

years old?

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 160

Page 156: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

6. Discussion

Do we really want every technological progress

that will be possible in future?

Page 157: Processing of Raw Genome Data - Hasso Plattner · PDF fileProcessing of Raw Genome Data Kristina Kirsten, Torben Meyer Trends in Bioinformatics Hasso-Plattner-Institut

■ B. Langmead, C. Trapnell, M. Pop, S. Salzberg: Ultrafast and memory-efficient

alignment of short DNA sequences to the human genome in Genome Biology

2009

■ B. Langmead, S. Salzberg: Fast gapped-read alignment with Bowtie 2. Nature

Methods Vol. 9 No.4,2012

■ P. J. A. Cock et al.: The Sanger FASTQ file format for sequences with quality

scores, and the Solexa/Illumina FASTQ variants. Nucelic Acids Research, Vol. 38,

No. 6, 2010

■ Dolled-Filhart, Marisa P., et al. Computational and bioinformatics frameworks for

next-generation whole exome and genome sequencing. The Scientific World

Journal 2013 (2013).

■ Nielsen, Rasmus, et al. Genotype and SNP calling from next-generation

sequencing data. Nature Reviews Genetics 12.6 (2011): 443-451.

Literature

Kirsten, KristinaMeyer, Torben27.01.2016

Processing of Raw Genome Data

Chart 162