Upload
luke
View
123
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? . Jen Taylor Bioinformatics Team CSIRO Plant Industry. Assumptions. Every k-mer has equal chance of being sequenced. Read density. Deviations from Assumptions?. - PowerPoint PPT Presentation
Citation preview
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation?
Jen Taylor
Bioinformatics Team
CSIRO Plant Industry
CSIRO. Newton Meeting July 2010 - Sequence coverage
Assumptions
• Every k-mer has equal chance of being sequenced
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read density
CSIRO. Newton Meeting July 2010 - Sequence coverage
Deviations from Assumptions?
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline
• Sample preparation• MNase Digestion
• Alignment• Parameter choices
• Mismatches• Multiple read mappings
• Hamming edit distances and k-mer space
CSIRO. Newton Meeting July 2010 - Sequence coverage
Assumptions : Digestion
Illumina SOLiD
http://seq.molbiol.ru/sch_lib_fr.html
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq
MNaseLinker Digest
Sequence &Align
RemoveNucleosomes
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq - Nucleosome
Sample:
MNase digested
Size fractionated
Control:
MNase digested
Random sizes
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 36-MerMonomer Composition
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 5’ +/- 16bpMonomer Composition
CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase Site PreferencingFlick et al., J. Mol. Biology 1986
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control MNase Site Preferencing
Sequence Occurrences Sequence Starts Preference (%)
ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88
aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq
MNaseDigest
Sequence &Align
RemoveNucleosomes
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control MNase Site Preferencing
Sequence Occurrences Sequence Starts Preference (%)
ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88
aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials – Read Density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Nor
mal
ised
Rea
d D
ensi
ty
Base Coordinate
1 Kb
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00MNase Potential
Nor
mal
ised
Rea
d D
ensi
ty
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00MNase Potential
Nor
mal
ised
Rea
d D
ensi
ty
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potential
CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase biases aiding interpretation?
• Can aid identification in a local sequence ?• Dependent upon local sequence context
• Cautionary tale about analysing sequence contexts of ChipSeq data
• Nucleotide composition analyses must take into account digestion preferencing
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline
• Sample preparation• MNase Digestion
• Alignment• Parameter choices
• Mismatches• Multiple read mappings
• Hamming edit distances and k-mer space
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming Edit Distances
• Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k
• For all possible kmers (36, 65 ) in Arabidopsis genome• All vs.All, both strands
• Minimum HE distance
Target Sequence C G T A C A T G C
Probe Sequence C G T T C A G G C
Substitution Required N N N Y N N Y N N
Hamming 2
CSIRO. Newton Meeting July 2010 - Sequence coverage
Arabidopsis Minimum Hamming Edit Distances 36mer
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment issues
0 2 4 6 8 10 12 14
hg18
dm3
araTha9
ce6
sacCer6
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts : aligner properties
Mismatch Read length
Genome pre-
processing
Reads pre-processing
Uses quality score
Reports unmapped
readsMultithread
SOAP 0-5 60
SOAP2 0-5 1 ?
Maq 1-3 2 ?
Bowtie 0-3 3 1024
Ubsalign 0-20 1024 4 5
CSIRO. Newton Meeting July 2010 - Sequence coverage
Breakdown of sequencing run
Reads PercentageTotal Sequences 76,034,736 100%Total Unique Sequences 33,188,251 44%Mapped to unique location 22,807,050 30%Failed mapping 10,381,201 14%
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCTGGTACTGGTA….
AGATTAGCCTGGTACTGCTA
2H
2
H
…..AGCTTAGCCGGGTACTGGTA….
AGATTAGCCTGGTACTGCTA3
No Alignment
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference
AGATTAGCCTGGTACTGCTA
…..AGATTAGCCTGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
2H
0
H
…..AGCTTAGCCGGGTACTGCTA….
AGATTAGCCTGGTACTGCTA2
No Alignment
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming edits and Ubsalign HE difference
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCTGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
2H
1
H
…..AGCTTAGCCGGGTTCTGGTA….
AGATTAGCCTGGTACTGCTA4
Alignment !
CSIRO. Newton Meeting July 2010 - Sequence coverage
Testing Aligner Accuracy
• Simulated reads• Known correct location
• 25 million, 50 million
• Perfect match, up to 5 mismatches, up to 10 mismatches
• Error 3’ bias
• Numbers of :• correctly aligned reads
• incorrectly aligned reads
• Unalignable reads
• Speed
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds
50 Million Reads Accuracy - Correct
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Perfect Match Up to 5M Up to 10M
Per
cen
tag
e o
f to
tal
read
s
UBSAligner Bowtie - d Bowtie - best
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds
50 Million Reads Accuracy - Unaligned
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Perfect Match Up to 5M Up to 10M
Per
cen
tag
e o
f to
tal
read
s
UBSAligner Bowtie - d Bowtie - best
CSIRO. Newton Meeting July 2010 - Sequence coverage
How does this affect interpretation ?
• Incorporation of edit differentials• Leads to gains in the number of alignable reads
• Increased information• Determination of the alignment• Gains of 5 - 10% in mappable sites
• Hamming edit distributions provide useful information
Impact of MNase digestion on short read sequence coverage
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming distance variability
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts
CSIRO. Newton Meeting July 2010 - Sequence coverage
Sequence deserts
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Conclusions
• Sample preparation• MNase Digestion• Local biases present
• Alignment• Parameter choices
• Mismatches – generally too low relative to uniqueness of kmers in the genome
• Multiple read mappings – can drive ‘absence’ of mapped reads
• Hamming edit distances and k-mer space• Kmers have unique and genome specific properties
• Can be used to inform results of alignment
CSIRO. Newton Meeting July 2010 - Sequence coverage
Acknowledgements
CSIRO PI Bioinformatics Team
Andrew Spriggs
Stuart Stephen
Emily Ying
Jose Robles
Michael James
CSIRO Prog X
Chris Helliwell
Frank Gubler
Liz Dennis
CSIRO Transformational Biology Capability Platform
David Lovell
Mark Morrison
CMIS / TBCP
Paul Greenfield
CSIRO. Newton Meeting July 2010 - Sequence coverage
Paired end data – sample preparation
CG
AT
insert
insert
CSIRO. Newton Meeting July 2010 - Sequence coverage
Control and sample read density
Control
Sample