40
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry

Jen Taylor Bioinformatics Team CSIRO Plant Industry

  • Upload
    luke

  • View
    123

  • Download
    2

Embed Size (px)

DESCRIPTION

Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? . Jen Taylor Bioinformatics Team CSIRO Plant Industry. Assumptions. Every k-mer has equal chance of being sequenced. Read density. Deviations from Assumptions?. - PowerPoint PPT Presentation

Citation preview

Page 1: Jen Taylor Bioinformatics Team CSIRO Plant Industry

Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation?

Jen Taylor

Bioinformatics Team

CSIRO Plant Industry

Page 2: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Assumptions

• Every k-mer has equal chance of being sequenced

Page 3: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Read density

Page 4: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Deviations from Assumptions?

Page 5: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Impacts on read coverage - Outline

• Sample preparation• MNase Digestion

• Alignment• Parameter choices

• Mismatches• Multiple read mappings

• Hamming edit distances and k-mer space

Page 6: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Assumptions : Digestion

Illumina SOLiD

http://seq.molbiol.ru/sch_lib_fr.html

Page 7: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

ChIPSeq

MNaseLinker Digest

Sequence &Align

RemoveNucleosomes

Page 8: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

ChIPSeq - Nucleosome

Sample:

MNase digested

Size fractionated

Control:

MNase digested

Random sizes

Page 9: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

araTha9 Aligned Reads 36-MerMonomer Composition

Page 10: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

araTha9 Aligned Reads 5’ +/- 16bpMonomer Composition

Page 11: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

MNase Site PreferencingFlick et al., J. Mol. Biology 1986

Page 12: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

araTha9 Control MNase Site Preferencing

Sequence Occurrences Sequence Starts Preference (%)

ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88

aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0

Page 13: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

ChIPSeq

MNaseDigest

Sequence &Align

RemoveNucleosomes

Page 14: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

araTha9 Control MNase Site Preferencing

Sequence Occurrences Sequence Starts Preference (%)

ctataggg 499 245 49.10taataggg 864 424 49.10gtattagg 1044 253 24.23tctttgct 4902 425 8.67cacattac 1807 52 2.88tcccagac 695 20 2.88

aaacaaca 10083 159 1.58acacgagc 810 2 0.25tttgtttt 32186 35 0.19tttgcata 4602 5 0.11ttggttta 7671 1 0.01gaggtttt 3926 0 0

Page 15: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Nucleosome potentials – Read Density

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Nor

mal

ised

Rea

d D

ensi

ty

Base Coordinate

1 Kb

Page 16: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Nucleosome potentials

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00MNase Potential

Nor

mal

ised

Rea

d D

ensi

ty

Page 17: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Nucleosome potentials

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00MNase Potential

Nor

mal

ised

Rea

d D

ensi

ty

Page 18: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Nucleosome potential

Page 19: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

MNase biases aiding interpretation?

• Can aid identification in a local sequence ?• Dependent upon local sequence context

• Cautionary tale about analysing sequence contexts of ChipSeq data

• Nucleotide composition analyses must take into account digestion preferencing

Page 20: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Impacts on read coverage - Outline

• Sample preparation• MNase Digestion

• Alignment• Parameter choices

• Mismatches• Multiple read mappings

• Hamming edit distances and k-mer space

Page 21: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Hamming Edit Distances

• Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k

• For all possible kmers (36, 65 ) in Arabidopsis genome• All vs.All, both strands

• Minimum HE distance

Target Sequence C G T A C A T G C

Probe Sequence C G T T C A G G C

Substitution Required N N N Y N N Y N N

Hamming 2

Page 22: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Arabidopsis Minimum Hamming Edit Distances 36mer

Page 23: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Alignment issues

0 2 4 6 8 10 12 14

hg18

dm3

araTha9

ce6

sacCer6

Page 24: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Alignment artefacts : aligner properties

Mismatch Read length

Genome pre-

processing

Reads pre-processing

Uses quality score

Reports unmapped

readsMultithread

SOAP 0-5 60

SOAP2 0-5 1 ?

Maq 1-3 2 ?

Bowtie 0-3 3 1024

Ubsalign 0-20 1024 4 5

Page 25: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Breakdown of sequencing run

Reads PercentageTotal Sequences 76,034,736 100%Total Unique Sequences 33,188,251 44%Mapped to unique location 22,807,050 30%Failed mapping 10,381,201 14%

Page 26: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Hamming edits and Ubsalign HE difference

AGATTAGCCTGGTACTGCTA

…..AGCTTAGCCTGGTACTGGTA….

AGATTAGCCTGGTACTGCTA

2H

2

H

…..AGCTTAGCCGGGTACTGGTA….

AGATTAGCCTGGTACTGCTA3

No Alignment

Page 27: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Hamming edits and Ubsalign HE difference

AGATTAGCCTGGTACTGCTA

…..AGATTAGCCTGGTACTGCTA….

AGATTAGCCTGGTACTGCTA

2H

0

H

…..AGCTTAGCCGGGTACTGCTA….

AGATTAGCCTGGTACTGCTA2

No Alignment

Page 28: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Hamming edits and Ubsalign HE difference

AGATTAGCCTGGTACTGCTA

…..AGCTTAGCCTGGTACTGCTA….

AGATTAGCCTGGTACTGCTA

2H

1

H

…..AGCTTAGCCGGGTTCTGGTA….

AGATTAGCCTGGTACTGCTA4

Alignment !

Page 29: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Testing Aligner Accuracy

• Simulated reads• Known correct location

• 25 million, 50 million

• Perfect match, up to 5 mismatches, up to 10 mismatches

• Error 3’ bias

• Numbers of :• correctly aligned reads

• incorrectly aligned reads

• Unalignable reads

• Speed

Page 30: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Alignment artefacts :Managing mismatch thresholds

50 Million Reads Accuracy - Correct

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Perfect Match Up to 5M Up to 10M

Per

cen

tag

e o

f to

tal

read

s

UBSAligner Bowtie - d Bowtie - best

Page 31: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Alignment artefacts :Managing mismatch thresholds

50 Million Reads Accuracy - Unaligned

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Perfect Match Up to 5M Up to 10M

Per

cen

tag

e o

f to

tal

read

s

UBSAligner Bowtie - d Bowtie - best

Page 32: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

How does this affect interpretation ?

• Incorporation of edit differentials• Leads to gains in the number of alignable reads

• Increased information• Determination of the alignment• Gains of 5 - 10% in mappable sites

• Hamming edit distributions provide useful information

Impact of MNase digestion on short read sequence coverage

Page 33: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Hamming distance variability

Page 34: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Read Deserts

Page 35: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Read Deserts

Page 36: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Sequence deserts

Page 37: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Impacts on read coverage - Conclusions

• Sample preparation• MNase Digestion• Local biases present

• Alignment• Parameter choices

• Mismatches – generally too low relative to uniqueness of kmers in the genome

• Multiple read mappings – can drive ‘absence’ of mapped reads

• Hamming edit distances and k-mer space• Kmers have unique and genome specific properties

• Can be used to inform results of alignment

Page 38: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Acknowledgements

CSIRO PI Bioinformatics Team

Andrew Spriggs

Stuart Stephen

Emily Ying

Jose Robles

Michael James

CSIRO Prog X

Chris Helliwell

Frank Gubler

Liz Dennis

CSIRO Transformational Biology Capability Platform

David Lovell

Mark Morrison

CMIS / TBCP

Paul Greenfield

Page 39: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Paired end data – sample preparation

CG

AT

insert

insert

Page 40: Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July 2010 - Sequence coverage

Control and sample read density

Control

Sample