Jen Taylor Bioinformatics Team CSIRO Plant Industry

Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation?

Jen Taylor

Bioinformatics Team

CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Assumptions

• Every k-mer has equal chance of being sequenced


Read density


Deviations from Assumptions?


Impacts on read coverage - Outline

• Sample preparation• MNase Digestion

• Alignment• Parameter choices

• Mismatches• Multiple read mappings

• Hamming edit distances and k-mer space


Assumptions : Digestion

Illumina SOLiD

http://seq.molbiol.ru/sch_lib_fr.html


ChIPSeq

MNaseLinker Digest

Sequence &Align

RemoveNucleosomes


ChIPSeq - Nucleosome

Sample:

MNase digested

Size fractionated

Control:

MNase digested

Random sizes


araTha9 Aligned Reads 36-MerMonomer Composition


araTha9 Aligned Reads 5’ +/- 16bpMonomer Composition


MNase Site PreferencingFlick et al., J. Mol. Biology 1986


araTha9 Control MNase Site Preferencing

Sequence Occurrences Sequence Starts Preference (%)

ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88

aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0


ChIPSeq

MNaseDigest

Sequence &Align

RemoveNucleosomes


araTha9 Control MNase Site Preferencing

Sequence Occurrences Sequence Starts Preference (%)

ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88

aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0


Nucleosome potentials – Read Density

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Nor

mal

ised

Rea

d D

ensi

ty

Base Coordinate

1 Kb


Nucleosome potentials

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00MNase Potential

Nor

mal

ised

Rea

d D

ensi

ty


Nucleosome potentials

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00MNase Potential

Nor

mal

ised

Rea

d D

ensi

ty


Nucleosome potential


MNase biases aiding interpretation?

• Can aid identification in a local sequence ?• Dependent upon local sequence context

• Cautionary tale about analysing sequence contexts of ChipSeq data

• Nucleotide composition analyses must take into account digestion preferencing


Impacts on read coverage - Outline

• Sample preparation• MNase Digestion


• Mismatches• Multiple read mappings

• Hamming edit distances and k-mer space


Hamming Edit Distances

• Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k

• For all possible kmers (36, 65 ) in Arabidopsis genome• All vs.All, both strands

• Minimum HE distance

Target Sequence C G T A C A T G C

Probe Sequence C G T T C A G G C

Substitution Required N N N Y N N Y N N

Hamming 2


Arabidopsis Minimum Hamming Edit Distances 36mer


Alignment issues

0 2 4 6 8 10 12 14

hg18

dm3

araTha9

ce6

sacCer6


Alignment artefacts : aligner properties

Mismatch Read length

Genome pre-

processing

Reads pre-processing

Uses quality score

Reports unmapped

readsMultithread

SOAP 0-5 60

SOAP2 0-5 1 ?

Maq 1-3 2 ?

Bowtie 0-3 3 1024

Ubsalign 0-20 1024 4 5


Breakdown of sequencing run

Reads PercentageTotal Sequences 76,034,736 100%Total Unique Sequences 33,188,251 44%Mapped to unique location 22,807,050 30%Failed mapping 10,381,201 14%


Hamming edits and Ubsalign HE difference

AGATTAGCCTGGTACTGCTA

…..AGCTTAGCCTGGTACTGGTA….


2H

2

H

…..AGCTTAGCCGGGTACTGGTA….

AGATTAGCCTGGTACTGCTA3

No Alignment




…..AGATTAGCCTGGTACTGCTA….


2H

0

H

…..AGCTTAGCCGGGTACTGCTA….


No Alignment




…..AGCTTAGCCTGGTACTGCTA….


2H

1

H

…..AGCTTAGCCGGGTTCTGGTA….


Alignment !


Testing Aligner Accuracy

• Simulated reads• Known correct location

• 25 million, 50 million

• Perfect match, up to 5 mismatches, up to 10 mismatches

• Error 3’ bias

• Numbers of :• correctly aligned reads

• incorrectly aligned reads

• Unalignable reads

• Speed


Alignment artefacts :Managing mismatch thresholds

50 Million Reads Accuracy - Correct

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Perfect Match Up to 5M Up to 10M

Per

cen

tag

e o

f to

tal

read

s

UBSAligner Bowtie - d Bowtie - best


Alignment artefacts :Managing mismatch thresholds

50 Million Reads Accuracy - Unaligned

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Perfect Match Up to 5M Up to 10M

Per

cen

tag

e o

f to

tal

read

s

UBSAligner Bowtie - d Bowtie - best


How does this affect interpretation ?

• Incorporation of edit differentials• Leads to gains in the number of alignable reads

• Increased information• Determination of the alignment• Gains of 5 - 10% in mappable sites

• Hamming edit distributions provide useful information

Impact of MNase digestion on short read sequence coverage


Hamming distance variability


Read Deserts


Read Deserts


Sequence deserts


Impacts on read coverage - Conclusions

• Sample preparation• MNase Digestion• Local biases present


• Mismatches – generally too low relative to uniqueness of kmers in the genome

• Multiple read mappings – can drive ‘absence’ of mapped reads

• Hamming edit distances and k-mer space• Kmers have unique and genome specific properties

• Can be used to inform results of alignment


Acknowledgements

CSIRO PI Bioinformatics Team

Andrew Spriggs

Stuart Stephen

Emily Ying

Jose Robles

Michael James

CSIRO Prog X

Chris Helliwell

Frank Gubler

Liz Dennis

CSIRO Transformational Biology Capability Platform

David Lovell

Mark Morrison

CMIS / TBCP

Paul Greenfield


Paired end data – sample preparation

CG

AT

insert

insert


Control and sample read density

Control

Sample

Documents

Jen Taylor Bioinformatics Team CSIRO Plant Industry