High-Throughput Sequencing Advanced Microarray Analysis BIOS 691-803, 2008 Dr. Mark Reimers, VCU

High-Throughput Sequencing

Advanced Microarray Analysis

BIOS 691-803, 2008

Dr. Mark Reimers, VCU

Quantitative HTS - Outline

• Technology

• Preprocessing

• Quantitative analysis

• Applications– ChIP-Seq– RNA-Seq– Methyl-Seq

The Technology

• Most sequencing proceeds by addition of fluor-labeled bases

• Do this in parallel on a flat surface

• Capture each stage with good camera

• Align images

Roche - 454

• Parallel Pyrosequencing on beads

Mardis, Trends in Genetics

454 Sequencing Operation

Illumina - Solexa

ABI SOLiD

• Resquencing each fragment with different primers

• Reconstruct each fragment separately

Paired-End Reads

Issues

• Pre-processing– Base calling– Mapping reads– QA

• Quantitative analysis– Variation and noise– Biases– Models– Accuracy and validation

Pre-processing – Base Calling

• Not all steps completed properly

• Sequence can lag behind or skip ahead

• Hence most light spots a mixture of different colors

• Simple rule: use brightest signal

Types of mismatches in uniquely mapped tags with a single mismatch are profoundly

asymmetric and biased

0

100000

200000

300000

400000

500000

600000

700000

800000In

se

rt G

De

lete

C

Ins

ert

C

De

lete

G

Ins

ert

T

Ins

ert

A

De

lete

A

De

lete

T

C >

G

T >

G

A >

T

A >

G

C >

T

T >

C

T >

A

C >

A

G >

C

A >

C

G >

A

G >

T

An

y s

ing

le

Courtesy Thierry-Mieg

Typical Errors in Base-Calling

Position of single mismatch in uniquely mapped tags

0

10000

20000

30000

40000

50000

60000

0 3 6 9 12 15 18 21 24 27 30 33 36

position of single mismatch

sample 1

sample 2

Courtesy Thierry-Mieg

Improving Base-Calling with SVM

Pre-processing – Mapping Reads

• Huge numbers (10M – 70M)

• BLAT (2002 high-speed method)

• Eland (proprietary Illumina)

• Other new methods: MAQ, SOAP

Quality Assessment

• Fraction of reads mapping to targets

• Typically 5-10M reads per lane and 60-80% map to targets

• Some repetitive sequence

Comparing Samples - A Simple Normalization

• Different numbers of counts per lane

• Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM)

• For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)

Quant. Analysis - Variation

• Poisson model often used for random variation

• Most HTS data ‘over-dispersed’ relative to Poisson

• Negative Binomial often used – Parameter fitted

Quantitative Analysis - Biases

• Not all regions represented equally

• GC rich regions represented more

• Independent of GC some chromosome regions represented more – Euchromatin bias

• Sequence initiation site biases

• ‘Mapability’ biases – some regions won’t have any uniquely mapped tags

GC Bias

• Density of reads depends strongly on GC content of regions

GC content (%)

Genomic Position Biases

• Count tags from randomly sheared DNA in red with GC content in blue

Start Position Bias

Consistent Start Position Bias

Counts per start site in lane 1 vs lane 2

RNA-Seq

RNA-Seq Data

From Marioni et al 2008

Kidney Reads

Liver Reads

Gene Model

Accuracy of Illumina RNA-Seq

Comparing RNA-Seq & Affy

Issues• How replicable is

RNA-Seq?• How consistent are

the two technologies?• Which is better?• Marioni et al, Genome

Research, 2008

Comparing Fold-Changes

• D.E. by ILM• Red >250• Green <250

• Black Not DE by ILM

Model for Variation

• Poisson counts

hypergeometric

comparison

• Make uniform p-values

by adding random term– Use lower tails only

21

21

2

2

1

1210 );(

xx

CC

x

C

x

Cxxp

False Positive Rates

• QQ-plots of p-values between tech. reps

Different Concentrations are NOT Comparable!

• QQ-plots of p-values between 3pM and 1.5 pM

Normalization of RNA-Seq

• Robinson et al noticed that most genes appeared less expressed in liver

Fig 1 from Robinson & Oshlak, Genome Biology 2010

A Better Normalization for RNA-Seq - TMM

• Drop extremes of ratios

• Drop very high count genes

• Compute trimmed means of samples

• Center log-ratios between samples

New Things to do with RNA-Seq

• Allele-specific expression

• Splice variation– Between tissues– In disease

• Alternate initiation sites– Select 5’ capped RNA fragments

• Alternate termination

Allelic Comparison

• It is possible to compare allele-specific expression counts

• Sample from VCU• Replicate samples• P-values for binomial

tests of equality• About half show

differential expression!

Histogram of 2 * pnorm(-abs(z))

2 * pnorm(-abs(z))

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8 1.0

05

01

00

15

02

00

25

03

00

35

0

Detecting Splice Variation

• Deep sequencing shows up clear variation in exon usage

• Wang et al Nature 2008

Tissue Map of Splice Variation

• Brain is most distinctive

• Individuals seem to differ

• Cell lines seem to have distinct splice patterns

From Wang et al

Splicing is Complex

• Many different splice operations exist

• Only some of these characterized by counting exon reads

Issues in Detecting Splice Variants

• Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance

• Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads

• All possible splice junctions are not known– Hard to even search through the known ones

Methodology for Splice Variants

• Count reads mapped to exons and and compare ratios across samples – Wang et al, and most others

• Count reads that cross splice junctions

Methodology for Finding Junctions

ChIP-Seq

Chromatin Immuno-precipitation

ChIP-Seq Workflow

• Cross-link proteins to DNA

• Fragment DNA

• Extract with antibody

• Reverse cross links

• Sequence fragments

• DO CONTROLS!

ChIP-Seq Data

• From Rozowsky et al, Nature Biotech 2009

ChIP-Seq vs ChIP-chip

Peak-Finding - Simple

• Extend tags and count overlap

• How much to extend?

Peak Finding – Better

• Tags starting on opposite strands are likely to start at opposite ends

• Identifying the cross-over point leads to improved accuracy

The Value of Controls: ChIP vs. Control Reads

Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

Cause of Variation in Read Density

• In study of FoxA1 binding, even control reads enriched near FoxA1 binding site!

• Probably due to open chromatin near FoxA1 binding site

Density of Control Channelreads around FoxA1 site

Courtesy Shirley Liu

ChIP-Seq – MACS

Key Ideas• Smart peak imputation estimate

– Uses read directions– Empirical estimate of fragment length

• Local frequency estimate– Using control, if available– Using wide estimate, otherwise– Not using sequence

Read Lengths and Directions

• Some clear clusters – even before stats

• Reads on opposite sides of peak map to opposite strands – Hence fragments have opposite directions

• Can estimate apparent fragment length

Fragment Lengths

• Puzzle: Fragments from sonication expected to be between 200 – 500 bp

• Estimated fragment size ~ 100 bp

• Shirley Liu’s explanation: preferential cutting near to TF ??

Comparison to ChIP-chip

• Broad correlation • Not dramatic improve-

ment in precision !

Methyl-Seq

Methylation Assays

• Affinity purification: e.g. MeDIP-Seq (methylated dinucleotide immunoprecipitation)

• Methylation-specific cleavage by endonucleases– e.g. Methyl-Seq: Cleaves with HPA2 to identify

• Bisulphite conversion– WGBS (Whole-Genome Bisulphite Sequencing)– RRBS (Reduced Representation Bisulphite

Sequencing)• Cleaves with MSPI to reduce complexity

Affinity: MeDIP-Seq & MBD-Seq

Issues with Affinity Methods

• Analysis essentially like ChIP-Seq

• BUT: Sequence count reflects both density of CpG’s and proportions of methylation

• No individual CpG-level information

• Advantages: no conversion so sequence tags are easily mappable

Methyl-Seq

• Use HPAII to cleave only at unmethylated CCGG sites

• Size-select fragments (50-300)

• Sequence fragment ends – Always starting at a CCGG

• Easy to map – few possible loci (<1M)

• Paired ends give actual fragment

Schematic Here

Issues for Methyl Seq

• Computational problem to re-assemble actual proportions of methylation at each locus from counts

• Prone to false positives because of incomplete digestion (for reasons other than methylation of CCGG site)– e.g. insufficient time … – rates vary by 50-fold depending on sequence

context

WGBS

• Bisulphite conversion, fragmentation and shotgun sequencing

• Requires very many reads!

• Use of capture arrays reduces work… BUT different sequences have different capture efficiencies!

WGBS Data (from capture array)

• top, CHP-SKN-1; bottom, MDA-MB-231

• NB. Inconsistent tag numbers

Issues with WGBS

• Lose many C’s

• Hard to map to genome

• Strategy depends on less penalty for mapping T to C

• Too many loci!

RRBS

• Too many methylation sites in genome

• Cleave with MSPI and size select in order to reduce number of fragments

• Convert C to T with bisulphite (not mC)

• Then sequence fragments

• 1.4 M fragments

Issues with RRBS

• Fairly broad but not complete coverage of ‘interesting’ regions of genome

• Bisulphite conversion of limited regions means mapping is fairly easy

• Bisulphite conversion not always complete

Meta-Genomics

What is Meta-genomics?

• Sequencing random fragments of DNA from all microbial denizens of a community (and traces of a few others)

• Sometimes broadly used for surveys of microbial diversity based on sequencing all 16SrRNA genes present

Kinds of Questions

• What is out there?– Most microbial species not known

• What metabolic fluxes in any environment?

• What microbes associated with specific conditions?– Including disease or health

• Human Microbiome Project

Environmental Meta-Genomics

Human Microbiome Project

Data Analysis Issues – 16S rRNA

• Identification of microbes – most are unknown and un-culturable

• Distinguishing errors in sequencing from novel microbes

• Biases in sequencing

Data Analysis Issues - Metagenomics

• Mapping and characterizing unknown protein sequences

• Usually assume conservation

• Full-coverage allows assembly of genomes

• Counting

• Biases probably smaller (Bork)

Documents

High-Throughput Sequencing Advanced Microarray Analysis BIOS 691-803, 2008 Dr. Mark Reimers, VCU