Upload
eunice-park
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Quantitative HTS - Outline
• Technology
• Preprocessing
• Quantitative analysis
• Applications– ChIP-Seq– RNA-Seq– Methyl-Seq
The Technology
• Most sequencing proceeds by addition of fluor-labeled bases
• Do this in parallel on a flat surface
• Capture each stage with good camera
• Align images
Issues
• Pre-processing– Base calling– Mapping reads– QA
• Quantitative analysis– Variation and noise– Biases– Models– Accuracy and validation
Pre-processing – Base Calling
• Not all steps completed properly
• Sequence can lag behind or skip ahead
• Hence most light spots a mixture of different colors
• Simple rule: use brightest signal
Types of mismatches in uniquely mapped tags with a single mismatch are profoundly
asymmetric and biased
0
100000
200000
300000
400000
500000
600000
700000
800000In
se
rt G
De
lete
C
Ins
ert
C
De
lete
G
Ins
ert
T
Ins
ert
A
De
lete
A
De
lete
T
C >
G
T >
G
A >
T
A >
G
C >
T
T >
C
T >
A
C >
A
G >
C
A >
C
G >
A
G >
T
An
y s
ing
le
Courtesy Thierry-Mieg
Position of single mismatch in uniquely mapped tags
0
10000
20000
30000
40000
50000
60000
0 3 6 9 12 15 18 21 24 27 30 33 36
position of single mismatch
sample 1
sample 2
Courtesy Thierry-Mieg
Pre-processing – Mapping Reads
• Huge numbers (10M – 70M)
• BLAT (2002 high-speed method)
• Eland (proprietary Illumina)
• Other new methods: MAQ, SOAP
Quality Assessment
• Fraction of reads mapping to targets
• Typically 5-10M reads per lane and 60-80% map to targets
• Some repetitive sequence
Comparing Samples - A Simple Normalization
• Different numbers of counts per lane
• Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM)
• For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)
Quant. Analysis - Variation
• Poisson model often used for random variation
• Most HTS data ‘over-dispersed’ relative to Poisson
• Negative Binomial often used – Parameter fitted
Quantitative Analysis - Biases
• Not all regions represented equally
• GC rich regions represented more
• Independent of GC some chromosome regions represented more – Euchromatin bias
• Sequence initiation site biases
• ‘Mapability’ biases – some regions won’t have any uniquely mapped tags
Comparing RNA-Seq & Affy
Issues• How replicable is
RNA-Seq?• How consistent are
the two technologies?• Which is better?• Marioni et al, Genome
Research, 2008
Model for Variation
• Poisson counts
hypergeometric
comparison
• Make uniform p-values
by adding random term– Use lower tails only
21
21
2
2
1
1210 );(
xx
CC
x
C
x
Cxxp
Normalization of RNA-Seq
• Robinson et al noticed that most genes appeared less expressed in liver
Fig 1 from Robinson & Oshlak, Genome Biology 2010
A Better Normalization for RNA-Seq - TMM
• Drop extremes of ratios
• Drop very high count genes
• Compute trimmed means of samples
• Center log-ratios between samples
New Things to do with RNA-Seq
• Allele-specific expression
• Splice variation– Between tissues– In disease
• Alternate initiation sites– Select 5’ capped RNA fragments
• Alternate termination
Allelic Comparison
• It is possible to compare allele-specific expression counts
• Sample from VCU• Replicate samples• P-values for binomial
tests of equality• About half show
differential expression!
Histogram of 2 * pnorm(-abs(z))
2 * pnorm(-abs(z))
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8 1.0
05
01
00
15
02
00
25
03
00
35
0
Detecting Splice Variation
• Deep sequencing shows up clear variation in exon usage
• Wang et al Nature 2008
Tissue Map of Splice Variation
• Brain is most distinctive
• Individuals seem to differ
• Cell lines seem to have distinct splice patterns
From Wang et al
Splicing is Complex
• Many different splice operations exist
• Only some of these characterized by counting exon reads
Issues in Detecting Splice Variants
• Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance
• Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads
• All possible splice junctions are not known– Hard to even search through the known ones
Methodology for Splice Variants
• Count reads mapped to exons and and compare ratios across samples – Wang et al, and most others
• Count reads that cross splice junctions
ChIP-Seq Workflow
• Cross-link proteins to DNA
• Fragment DNA
• Extract with antibody
• Reverse cross links
• Sequence fragments
• DO CONTROLS!
Peak Finding – Better
• Tags starting on opposite strands are likely to start at opposite ends
• Identifying the cross-over point leads to improved accuracy
The Value of Controls: ChIP vs. Control Reads
Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation
Cause of Variation in Read Density
• In study of FoxA1 binding, even control reads enriched near FoxA1 binding site!
• Probably due to open chromatin near FoxA1 binding site
Density of Control Channelreads around FoxA1 site
Courtesy Shirley Liu
ChIP-Seq – MACS
Key Ideas• Smart peak imputation estimate
– Uses read directions– Empirical estimate of fragment length
• Local frequency estimate– Using control, if available– Using wide estimate, otherwise– Not using sequence
Read Lengths and Directions
• Some clear clusters – even before stats
• Reads on opposite sides of peak map to opposite strands – Hence fragments have opposite directions
• Can estimate apparent fragment length
Fragment Lengths
• Puzzle: Fragments from sonication expected to be between 200 – 500 bp
• Estimated fragment size ~ 100 bp
• Shirley Liu’s explanation: preferential cutting near to TF ??
Methylation Assays
• Affinity purification: e.g. MeDIP-Seq (methylated dinucleotide immunoprecipitation)
• Methylation-specific cleavage by endonucleases– e.g. Methyl-Seq: Cleaves with HPA2 to identify
• Bisulphite conversion– WGBS (Whole-Genome Bisulphite Sequencing)– RRBS (Reduced Representation Bisulphite
Sequencing)• Cleaves with MSPI to reduce complexity
Issues with Affinity Methods
• Analysis essentially like ChIP-Seq
• BUT: Sequence count reflects both density of CpG’s and proportions of methylation
• No individual CpG-level information
• Advantages: no conversion so sequence tags are easily mappable
Methyl-Seq
• Use HPAII to cleave only at unmethylated CCGG sites
• Size-select fragments (50-300)
• Sequence fragment ends – Always starting at a CCGG
• Easy to map – few possible loci (<1M)
• Paired ends give actual fragment
Issues for Methyl Seq
• Computational problem to re-assemble actual proportions of methylation at each locus from counts
• Prone to false positives because of incomplete digestion (for reasons other than methylation of CCGG site)– e.g. insufficient time … – rates vary by 50-fold depending on sequence
context
WGBS
• Bisulphite conversion, fragmentation and shotgun sequencing
• Requires very many reads!
• Use of capture arrays reduces work… BUT different sequences have different capture efficiencies!
Issues with WGBS
• Lose many C’s
• Hard to map to genome
• Strategy depends on less penalty for mapping T to C
• Too many loci!
RRBS
• Too many methylation sites in genome
• Cleave with MSPI and size select in order to reduce number of fragments
• Convert C to T with bisulphite (not mC)
• Then sequence fragments
• 1.4 M fragments
Issues with RRBS
• Fairly broad but not complete coverage of ‘interesting’ regions of genome
• Bisulphite conversion of limited regions means mapping is fairly easy
• Bisulphite conversion not always complete
What is Meta-genomics?
• Sequencing random fragments of DNA from all microbial denizens of a community (and traces of a few others)
• Sometimes broadly used for surveys of microbial diversity based on sequencing all 16SrRNA genes present
Kinds of Questions
• What is out there?– Most microbial species not known
• What metabolic fluxes in any environment?
• What microbes associated with specific conditions?– Including disease or health
• Human Microbiome Project
Data Analysis Issues – 16S rRNA
• Identification of microbes – most are unknown and un-culturable
• Distinguishing errors in sequencing from novel microbes
• Biases in sequencing