Other applications of second-‐generation sequencing
Review
• We have covered for second-‐generation sequencing:– Overview technologies.– Data and statistical issues.– RNA-‐seq, ChIP-‐seq and their analysis strategies.
• Today we will introduce some other applications of sequencing, mainly– For DNA methylation: bisulfite sequencing (BS-‐seq).– Hi-‐C for 3-‐dimensional chromatin structures.
EpigeneticsNon-‐DNA sequence related, heritable mechanisms to control gene expressions. Examples: DNA methylation, histone modifications.
wikipedia
DNA methylation
• An epigenetic modification of the DNA sequence.• Involves adding a methyl group to cytosine.• Primarily happens at the CpG sites (when C and G are at
consecutive bases), although non-‐CG methylation exists. • Mostly detected in higher organisms:
– In human genome, most CpG sites are fully methylated(over 90%) except at CpG island where the methylation level is minimal.
– Methylation are detected in some plants, insects and bacteria, but the levels are low.
http://www.delawareneuroscience.org/images/Investigator/DNA%20methylation_small.jpg
http://www.bio.miami.edu/dana/pix/cytosine.bmp
Function of DNA methylation
• Important in gene regulation: methylation at TSS suppress gene expression.
• Play crucial role in development and differentiation: help cells establish identity.
• Believed to be interacting with environment exposures. So it is being used to explain GxE interactions.
• Often referred to as the “5th base”. • Recent researches found different types of methylation, e.g.,
hydroxyl methylation.
DNA Methylation regulates gene expression
http://www.spandidos-‐publications.com/article_images/or/31/2/OR-‐31-‐02-‐0523-‐g00.jpg
Detecting DNA methylation
• Capture based: MeDIP-‐seq (Methylated DNA immunoprecipitation followed by sequencing). – Same as ChIP-‐seq, but use antibody against methylated DNA. – Analysis methods are the same as ChIP-‐seq. – Resolution is low: can roughly quantify the amount of DNA
methylation in a few hundred bps.
• Bisulfite sequencing (BS-‐seq): bisulfite conversion of DNA followed by sequencing:– Base pair resolution: measures the methylation status of each
nucleotide.
Bisulfite sequencing• Technology in a nutshell:– First treat the DNA with bisulfite. As a result,
• Unmethylated C will be turned into T.• Methylated C will be protected and still be C.• No change for other bases.
– Amplify, then sequence the treated DNA segments. • The mismatches between C-‐T measures the methylation strength.
• Raw data: sequence reads, but not exactly from the reference genome.
Bisulfite Sequencing
http://www.ecseq.com/services/EPIseq.html
Alignment of BS-‐seq
• The reads from BS-‐seq cannot be directly aligned to the reference genome. – There are four different strands after bisulfite treatment and PCR.
– T could be aligned to T or C. – The search space for alignment is bigger.
BMC Bioinformatics 2009, 10:232 http://www.biomedcentral.com/1471-2105/10/232
Page 3 of 9(page number not for citation purposes)
and is still lacking in current short read alignment soft-ware.
A common approach to overcome these issues is to con-vert all Cs to Ts and map the converted reads to the con-verted reference; then, the alignment results are post-processed to count false-positive bisulfite C/T alignmentsas mismatches, where a C in the BS-read is aligned to a Tin the reference [2]. Although this all-inclusive C/T con-version is effective for reads derived from the C-poorstrands, it is not appropriate for reads derived from the G-poor strands, where all the Cs are actually transcribedfrom Gs by PCR amplification and thus could not be con-verted to Ts during bisulfite treatment. During shotgunsequencing, however, a bisulfite read is almost equallylikely to be derived from either the C-poor or the G-poorstrands. There is no precise way to determine the original
strand a bisulfite read is derived from. Furthermore, byignoring the C/T mapping asymmetry, this strategy gener-ates a large number of false-positive bisulfite mappingsand greatly increases the computational load in a quad-ratic manner with an increase in the size of the referencesequence. In order to accurately extract the true bisulfitemappings in the post-processing stage, all mapping loca-tions have to be recorded, even the non-unique map-pings. Therefore, this approach is only practical for smallreference sequences, where only the C-poor strands aresequenced. For example, Meissner et al. used this map-ping strategy for reduced representation bisulfite sequenc-ing (RRBS) [2], where the genomic DNA was digested bythe Mspl restriction enzyme and 40–220 bp segmentswere selected for sequencing. The reference sequence (~27M nt) is only about 1% of the whole mouse genome, cov-ering 4.8% of the total CpG dinucleotides.
Mapping of bisulfite readsFigure 2Mapping of bisulfite reads. 1) Increased search space due to the cytosine-thymine conversion in the bisulfite treatment. 2) Mapping asymmetry: thymines in bisulfite reads can be aligned with cytosines in the reference (illustrated in blue) but not the reverse.
>>ATTTCG>>
>>ATACTTCGATGATCTCGCAAGACTCCGGC>>
ATTTCG ATTTCGATTTCG
Bisulfite Read
Reference
Bisulfite Read Reference
C
T
C
T
1) Multiple Mapping
2) Mapping Asymmetry
Alignment strategy• Use existing alignment software (eg, bowtie) as is:
– Problem: C-‐T mismatches make some reads can’t be aligned.
• Naïve method: change both the reference and reads to make all C’s to T’s, then align. – Problem: create other mismatches.
• Better ideas: – Consider the methylation status during alignment: create multiple
versions of the reference “seed” (there will be four sets of references at each locations containing a C ).
• Clever implementations needed.
Alignment tools
• See a list of available BS-‐seq aligner at http://www.mi.fu-‐berlin.de/w/ABI/ExistingBisulfiteMappers.
• Performances wise, they are usually slower:– in the rate of a few hundred reads per second.
Data after alignments• Special software needed to process the alignment file.• At each C position, report the total number of reads covering
that site, and the number of reads with T:
chr1301087422 18
chr1301089431 27
chr1301092212 10
chr130109577 6
chr130109716 6
chr130110257 5
• These are usually inputs for downstream BS-‐seq analysis.
BS-‐seq data analysis
• Compared with ChIP-‐seq and RNA-‐seq, still in relatively early stage.
• Questions include:– Single dataset analysis: • Segment genome according to methylation status.
– Comparison of multiple datasets:• Differential methylation (DM) analysis.
Single BS-‐seq dataset analysis
• Detecting the methylation loci/regions:– Estimate “methylation density” (percentage of cells have methylation) at each C position, which is simply #methy/#total at each CpG site, but: • Background error rates need to be considered.• Spatial correlation among nearby CpG sites can be utilized to improve estimation.
– Methylated regions (or states) can be determined by smoothing based method (e.g., moving average, HMM) using the estimated percentage as input.
An HMM approach
• Stadler et al. (2012) Nature: – Using the estimated percentages as input to fit a 3-‐state HMM: FMR, LMR and UMR.
DNase-I-hypersensitive sites (DHS), a unique chromatin state thatdepends on DNA-binding factors10–12. In fact, at least 80% of LMRsand 90% of UMRs overlap with DHS (Fig. 2 and SupplementaryFig. 2). LMRs are unlikely novel promoters as we find only weak signalfor RNA polymerase II (Fig. 2 and Supplementary Fig. 3) and no RNAsignal abovewhat we observe atmethylated regions evenwhen using astrand-specific protocol that does not require polyadenylation (Sup-plementary Fig. 3). Next, we explored if LMRs could represent distalregulatory regions, such as enhancers. Indeed, LMRs are stronglyenriched for chromatin features such as highH3K4monomethylation(H3K4me1) signal relative to H3K4 trimethylation (H3K4me3) andthe presence of p300 histone acetyltransferase, which are predictivefeatures of enhancers13 (Fig. 2). This indicates that a subset of LMRsare enhancers that, in light of the absence of H3K27me3 and thepresence of H3K27ac, are presumably active14 (Fig. 2b). Transgenicassays further show that individual LMRs increase the activity of alinked promoter and experimentally function as enhancers (Sup-plementary Fig. 4). We thus conclude that many LMRs, identifiedsolely by their DNA methylation pattern, represent active regulatoryregions.To investigate LMR features further, we combined newly generated
and published data sets for several DNA-binding factors and addi-tional histone modifications (Supplementary Table 1, Fig. 2b andSupplementary Figs 5 and 6). LMRs and UMRs are depleted for theheterochromatic histone modification H3K9me2 in agreement withthe absence of this mark at active chromatin6. Most DNA-bindingfactors show enrichment not only at UMRs, which are mostly pro-moters, but also at LMRs. Factors enriched at LMRs in stem cellsinclude pluripotency transcription factors such as Nanog, Oct4 andKlf4, but also structural DNA-binding factors such as the insulator
protein CTCF15 and members of the cohesin complex (Fig. 2b andSupplementary Fig. 5), both of which bind promoters and distalregulatory regions16. Notably, not all factors occupy distal andproximal regulatory regions with equal preferences. Smad1 binds toneither LMRs nor UMRs, whereas some bind primarily at UMRs, suchas KDM2A and Zfx, and others such as Nanog and Esrrb show higherenrichment at LMRs (Fig. 2b and Supplementary Fig. 5). In summary,several lines of evidence including genomic position, conservation,chromatin state, regulatory activity and transcription factor occupancysupport the hypothesis that LMRs are indeed active distal regulatoryregions.InterestinglyLMRsshowastrongpresenceof5-hydroxymethylcytosine
(5hmC), consistent with recent reports of 5hmC presence at enhancerregions17–19. One candidate protein responsible for catalysing 5hmC,Tet1 (refs 20, 21), is enriched at both UMRs and LMRs (Fig. 2b).To ask if LMRs are also present in other mammals we performed
HMM segmentation of a human stem cell methylome3, which alsoidentifies LMRswith similar features, indicating that these are a generalcharacteristic of mammalian methylomes (Supplementary Fig. 7).
Transcription factor binding creates LMRsTodetermine howLMRs are formed,we investigated theDNA-bindingprotein CTCF, which binds to regulatory regions including promoters,enhancers and insulators22,23.Wedetermined the genome-wide bindingof CTCF by chromatin immunoprecipitation followed by sequencing(ChIP-seq) (Supplementary Fig. 8), revealing high occupancy at bothUMRs and LMRs (Fig. 2b and Supplementary Fig. 5). A composite viewof DNA methylation shows an average methylation of 20% at CTCFbinding sites with increasing methylation adjacent to it (Supplemen-tary Fig. 9), in line with a previous report in primates24. If reducedmethylation is a general feature of CTCF-occupied sites, inclusion ofDNA methylation data should improve prediction of CTCF binding.
020
4060
8010
0M
ethy
latio
n (%
)
01
23
Enric
hmen
t
FMR UMR LMR
Tet15hmC.GLIB5hmC.CMSSmad1STAT3n-MycZfxKDM2AE2f1EsrrbKlf4NanogOct4Smc3Smc1NipblCTCFH3K27acH3K27me3H3K9me2p300Pol IIH3K4me3H3K4me2H3K4me1DNase IMethylation
a
b
UMRLMR
FMR
Mea
n co
nser
vatio
n Conservation
0 3–3 3–3 3–3
3–3 3–3 3–3
0.1
0.2
0.3
Enric
hmen
t (lo
g 2)
0
DNase I
0.0
0.5
1.0
1.5
Position around segment middle (kb)
Enric
hmen
t (lo
g 2)
00
00
H3K4me3 Pol II
H3K4me1 p300
0.0
0.5
1.0
0.0
1.0
2.0 1.
50.
00.
51.
01.
5
0.0
0.3
0.6
0.9
Figure 2 | General features of LMRs. Composite profiles 3 kb aroundsegment midpoints. a, Evolutionary conservation based on multi-speciesalignments (upper left). Enrichment of DNase I tags (lower left). Chromatinfeatures that predict enhancer function are enriched at LMRs (middle andright). b, Heat map of methylation levels, histone modifications and proteinbinding (H3K4me1 signal rescaled for visibility).
a c d
e f
b
025
5075
100
Met
hyla
tion
(%)
FMRLMRUMR
−3 0 3
Position around middle (kb)
0 5 10 15 20
0.00
0.10
Distance to TSS (log2 nt)
Den
sity
FMRLMRUMR
12
22
44
32
FMR(2,485.0 Mbp)
57
3
13 7
20
UMR(27.9 Mbp)
34
25
34
33
LMR(12.0 Mbp)
Promoter Exon Intron Repeat Intergenic
89 (1)
2 (1)9 (98)
CpG islands
FMR UMR LMR
(n = 15,974)
Methylation (%)
Frac
tion
of C
pGs
0.0
0.25
0.5
0−10
10−2
0
20−3
0
30−4
0
40−5
0
50−6
0
60−7
0
70−8
0
80−9
0
90−1
00
6.5% 4.1% 89.4%0
5010
0M
ethy
latio
n (%
)
CGITbx3
120 120.05 120.1 120.15chr5 (Mbp)
Genes
LMR
25 kb
Figure 1 | Features of the mouse ES cell methylome. a, Distribution of CpGmethylation frequency for all CpGs with at least tenfold coverage. Of allcytosines, 4.1% show intermediate methylation levels. b, Representativegenomic region. Computational segmentation identifies UMRs (bluepentagons), LMRs (red triangles) and FMRs (unmarked). Each dot representsone CpG (CpG islandsmarked in green). Included is an independently verifiedLMR upstream of Tbx3. Mbp, million base pairs. c, Composite profile of CpGmethylation for all three groups. kb, kilobases. d, Distances to TSS.e, f, Distribution of all three classes among genome features. e, A smallpercentage of LMRs overlap with CpG islands. Numbers indicate observedpercentage of overlaps per group (expected percentage in parentheses).f, Distribution of the regions throughout the genome.
ARTICLE RESEARCH
2 2 / 2 9 D E C E M B E R 2 0 1 1 | V O L 4 8 0 | N A T U R E | 4 9 1
Macmillan Publishers Limited. All rights reserved©2012
Smoothing method
• Can directly smooth the percentages, but that doesn’t consider the uncertainty in percentage estimates.
• A better approach: BSmooth model (Hansen et al. 2012 Genome Biology).– Assumes the true methylation level is a smooth curve of genomic coordinates.
– The observed counts follow a binomial distribution.
BSmooth smoothing• Notations at position j:– Nj, Mj: total/methylated reads.– πj: underlying true methylation level. – lj: location.
• Model:
• Fitting: weighted glm in each 2kb window, where the weights depend on the variances of estimated πj.
M j ~ Bin(N j,π j )
log(π j / (1−π j )) = β0 +β1l j +β2l j2
Bsmooth Bioconductor package: bsseq
• Mainly provide functions for smoothing and some visualization.
• Implemented in parallel computing environment to speed up the calculation.
M <- matrix(0:8, 3, 3)Cov <- matrix(1:9, 3, 3)BS1 <- BSseq(chr = c("chr1", "chr2", "chr1"),
pos = c(1,2,3), M = M, Cov = Cov, sampleNames = c("A","B", "C"))
BS1 <- BSmooth(BS1)
Differential methylation analysis• Comparison of methylation profiles under different biological conditions is of great interests.– Results from such analysis are: differentially methylated loci (DML) or regions (DMR).
• Strategy to detect DML:– Hypothesis testing at each CpG site.
• Strategy to detect DMR:– Need to combine data from nearby CpG sites because of the spatial correlation.
DML detection based on 2x2 table
• At each CpG site, summarize the counts from two samples into a 2x2 table:
• Chi-‐square or Fisher’s exact test can be applied.• Bsseq has function fisherTests for this:
fisherTests(BSobj, group1, group2)
Total Methylated
Sample 1 40 2
Sample 2 25 19
Wald-‐test based
• Can handle data with replicates.• The key is to estimate within group variances. • BSmooth approach (for two group comparison): – Denote the group assignment for ith sample by Xi.– Number of replicates in two groups are n1 and n2.– Frame the estimated values of into a two-‐group testing framework: πij=a(lj)+ b(lj)Xi+εi,j, εi,j~N(0, σj
2).
– Use SAM-‐alike method to estimate σj2, then do Wald test.
Shrinkage based method(Feng et al. 2014, NAR)
• Similar to that in RNA-‐seq DE analysis, the BS-‐seq data can be modeled as Beta-‐binomial distribution:
• Beta distribution is parameterized by mean and dispersion, and impose a log-‐normal prior on dispersions.
• Wald test procedure can be derived.
P a g e | 14
Materials and methods
The Bayesian hierarchical model
To characterize the data, we propose the following Bayesian hierarchical model, based on the
beta-binomial distribution. Notation for our model is as follows: at the ith CpG site, jth group
and kth replicate, 𝑋 is the number of reads that show methylation, 𝑁 is the total number of
reads that cover this position, and 𝑝 is the underlying “true” methylation proportion.
𝑋 |𝑝 ,𝑁 ~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 (𝑁 , 𝑝 )
𝑝 ~𝐵𝑒𝑡𝑎 𝜇 , ∅
Since the process of sequencing is a random sampling process from statistical perspective, The
the model assumes that 𝑋 |𝑝 ,𝑁 𝑋 follows a binomial distribution.. Since The the true
methylation proportions among replicates are can be anywhere between 0 and 1, it is a natural
choice to assume that the proportions d to follow beta distribution, as it is the most flexible
distribution with the support of interval between 0 and 1 and applicable to a wide variety of
disciplines. Here the beta distribution which is parameterized by mean (denoted by 𝜇 ) and
dispersion (denoted by ∅ ). Compared with the traditional parameterization of the Beta (𝛼, 𝛽)
distribution, the parameters have the following relationship:
𝜇 =𝛼
𝛼 + 𝛽, ∅ =
1𝛼 + 𝛽 + 1
P a g e | 15
Here, the biological variation among replicates is captured by the beta distribution and the
variation due to the random sampling of DNA segments during sequencing is captured by the
binomial distribution. The dispersion parameter ∅ captures the variation of a CpG site’s
methylation proportion relative to the group mean. We allow Each each CpG site within a single
condition (e.g. within cases, or controls) is assumed to havehas its own dispersion. It is a flexible
assumption because it allows. either different or common dispersions for both conditions.
To combine information across all CpG sites, based on the observed distribution of dispersion
from a publicly available RRBS dataset on mouse embryogenesis (27), we assumed the
following prior on ∅ :
∅ ~𝑙𝑜𝑔𝑛𝑜𝑟𝑚𝑎𝑙 ( 𝑚 , 𝑟 )
where 𝑚 𝑎𝑛𝑑 𝑟 are mean and variance parameters that can be estimated from the data. We
based our choice of a log normal distribution on the observed distribution of dispersion from a
publicly available RRBS dataset on mouse embryogenesis [24]. For each CpG site in this dataset,
we applied a MOM estimator to estimate the dispersion parameters. As shown in Figure 6, the
genome-wide distribution of logarithm dispersion parameter estimates is approximately Gaussian
with mean = -3.39 and SD = 1.08, suggesting that the dispersion parameters can be well-
described by a log-normal distribution. To be noticed, simulations which the dispersions are
from different distributions shows that our proposed method is robust against the violation of
log-normal assumption (Supplementary Figure 2).
Parameter estimation
Simulation results
• The Wald test with shrunk dispersion performs favorably compared with other methods.
1 1 1 1 1 1 1 1 1 1
200 400 600 800 1000
3040
5060
7080
90
Top ranked CpG sites
% th
at a
re tr
ue D
M
2 22
22
22
22
23 3 3 3 3 3 3 3 3 3
4 4 44
44
44
44
5 5 5 55
55
55
5
1 1 1 1 11
11
11
200 400 600 800 1000
3040
5060
7080
90
Top ranked CpG sites
% th
at a
re tr
ue D
M
2 2 2 22
22
22
2
3 3 3 33
33
33
3
4 4 4 4 4 44
44
4
5 5 5 5 5 55
55
5
12345
t−testFisherAdj. ChisqWald test, naive dispersionWald test, shrunk dispersion
Things to consider in DMR calling
• Coverage depth:– Should one filter out sites with shallower coverage?
• With biological replicates:– CpG specific biological variances.– Small sample estimate of the variance.
• Spatial correlation of methylation levels among nearby CpG sites.– Is smoothing appropriate? – What is data has low spatial correlation, like in 5hmC.
Existing methods for DML/DMR detection
• BSmooth (Hansen et al. 2012, GB):– Smoothing, then take the smoothed values and run two-‐group t-‐test.
• MethylKit(Akalin et al. 2012, GB):– Logsitic regression or Fisher’s exact test. – Recently implemented DSS Wald test approach.
• BiSeq (Hebestreitet al. 2013, Bioinformatics):– Smoothing, then take the smoothed value and run beta glm.
• DSS (Feng et al. 2014, NAR):– Based on beta-‐binomial model. Empirical Bayesian estimate of
dispersions, and Wald test. – Spatial correlations are ignored
• MOABS (Sun et al. 2014, GB): – Based on beta-‐binomial model to define CDIF, the lower bound of CI
for methylation difference in two groups.– Spatial correlations are ignored.
• methylSig (Hebestreitet al. 2014, Bioinformatics)– Based on beta-‐binomial model. MLE based method to estimate
dispersion. – Likelihood ratio test.
• DSS-‐single (Wu et al. 2015, NAR)– Works for single replicated data, use nearby CpG sites are “pseudo-‐
replicates”.
• RADMeth (Dolzhenkoet al. 2014, BMC Bioinformatics)– Based on beta-‐binomial GLM, works for multiple factor design.
Useful bioc packages -‐ bsseq
• First create BSseq objects• Use BSmooth function to smooth.• fisherTests performs Fisher’s exact test, if there’s no
replicate.• BSmooth.tstat performs t-‐test with replicates.• dmrFinder calls DMRs based on BSmooth.tstat results.
BSobj = BSmooth(BSobj)dmlTest=fisherTests(BSobj, group1=c("C1", "C2","C3"),
group2=c("N1","N2","N3"))
dmr <- dmrFinder(dmlTest)
Useful bioc packages -‐ DSS
• Input data has the same format as bsseq.• DMLtestperforms Wald test at each CpG.• callDML/callDMR calls DML or DMR.• More options in DML/DMR calling.
dmlTest <- DMLtest(BSobj, group1=c("C1", "C2", "C3"),group2=c("N1","N2","N3"),smoothing=TRUE, smoothing.span=500)
dmrs <- callDMR(dmlTest)
Another paradigm –single read BS-‐seq analysis
• So far we have focused on “marginal” methylation levels (aggregated information from all reads).
• Sometimes data at each single read provide additional information.
• Useful reads:– Xie et al. (2011) NAR.– Landan et al. (2012) Nat. Genetics.
Single read information
• Methylation entropy or polymorphism.
linkers. Bisulfite modification of genomic DNA was per-formed with EZ DNA Methylation Gold kit (ZymoResearch, Orange, CA, USA) according to the manufac-turer’s instructions.
PCR cloning, sequencing and multiple sequence aligments
PCR reactions were performed with Qiagen Hotstart PCRmaster kit (Qiagen). For each reaction, a 50 ml PCRmixture was prepared with 2 ml (100 ng) bisulfite treatedDNA, 50 pmol each forward and reverse primers. Theprimers used in the PCR runs for genomic locus 1(chr9:139174924-139175041) are 50-GGT TAT TTT TTTTTT AGT TTT GGT TTA GAT ATG A-30 and 50-TTTCTC CAA TCT TAA CTT AAA CAT AAT TCC-30. Theprimers used in the PCR runs for genomic locus 2(chr10:134480046-134480230) are 50-AAA TAT AATTTA GAA GGT ATT GTA GAT GTA AAT G-30 and50-CAT AAC TTA AAA AAT ATT ACA AAT ATAAAT ACC AAC-30. The PCR products with appropriatesize were gel-purified and cloned with TOPO vectors(Invitrogen). Sequencing reactions for colonies were con-ducted at the Sequencing Core Facility of the Children’sMemorial Research Center of Northwestern University’sFeinberg School of Medicine. To ensure an accurate cal-culation of the fidelity of inheritance of DNA methylation,the sequence reads contain unconverted cytosine atnon-CpG sites, due to the incomplete bisulfite conversion,were discarded. After the removal of vector and primersequences, the sequence reads obtained were subjected tomultiple alignments together with a reference sequence forcorresponding genomic locus. Multiple sequence align-ments were performed with clustal W (24).
Statistical analysis of the association between methylationentropies and DNA related attributes
The statistical analyses were conducted as previouslydescribed (22). Briefly, we compiled a comprehensive listof attributes that can be linked directly to the genomicregions of interest. The data for most of these attributeswere calculated based on the UCSC Genome AnnotationDatabase (25). The attributes for DNA sequence featureswere directly calculated based on the DNA sequence ex-tracted from the human genome. All the attributes areeither in the numerical form or boolean form (such aspresent in gene or not). The non-parametric Wilcoxonranksum test and chi-square test statistical tests were per-formed for each attribute in numerical form or booleanform, respectively. Significance thresholds were adjustedfor multiple testing using the highly conservativeBonferroni method, and the family-wise error rate wasset to be <1%.
RESULTS
The definition and statistical assessment of methylationentropy
Traditionally, DNA methylation data analysis is based onthe determination of the average methylation level (thepercentage of methylated CpG) of one or more contiguous
CpG sites. Such conventional way is unable to dissectDNA methylation patterns, which are herein defined asthe combination of methylation statuses of contiguousCpG dinucleotides in a DNA strand. In order to betterdecode epigenetic data, we defined ‘methylation entropy’and exploited it to assess the variability of DNA methy-lation pattern that might be observed for a given genomiclocus in a cell population. The concept of entropy was firstintroduced by Rudolf Clausius as a thermodynamicproperty and later modified as Shannon entropy in infor-mation theory to measure the degree of uncertaintyassociated with a stochastic event (26).
Entropy : HðXÞ ¼ $X
PðxÞ log2 PðxÞ
An important variable in entropy equation is the probabil-ity P(x) for a given event x. A frequently used example tointerpret the concept of Shannon entropy is tossing a coin,which has two possible outcomes. Since it is a randomevent, the probability for heads or tails would be 0.5.Similarly, the methylation status (methylated orunmethylated) of a CpG dinucleotide could be consideredas heads or tails but may not be random. Thus, theconcept of entropy could be modified to quantitativelyassess the variation in DNA methylation patterns.To calculate methylation entropy, the following param-
eters were introduced to the original entropy formula:(i) number of CpG sites in a given genomic locus;(ii) number of sequence reads generated for a genomiclocus and (iii) frequency of each distinct DNA methyla-tion pattern observed in a genomic locus, calculated basedupon the sequence reads that were generated for the locus(Figure 1A). The probability of a given event in Shannonentropy equation was replaced with the frequency of a
ME: Methylation Entropye: Entropy for code bitb: Number of CpG sitesni: Observed occurrence of methylation pattern iN: Total number of sequence reads generated
∑ −= )(Nn
LogNn
be
MEii
A
B C D E
ME = 0 ME = 0 ME = 0.1875 ME = 1
Figure 1. The formula of methylation entropy and the examples forgenomic loci with various methylation entropies in a cell population.(A) The formula of methylation entropy. The determination of methy-lation entropy requires three parameters: the number of CpG sites, thetotal number of sequence reads generated and the occurrence of eachmethylation pattern. (B–E) Genomic loci with various methylationentropies.
Nucleic Acids Research, 2011 3
by guest on March 14, 2014
http://nar.oxfordjournals.org/D
ownloaded from
Xie et al. (2011) NAR
What single read tells us
• Comparison of methyl-‐entropy/polymorphism among different samples.
• Sample deconvolution– Zheng et al. (2014) GB: MethylPurify– estimate the proportion of cell types in a mixed sample (such as cancer), as well as calling DMRs.
MethylPurify
with the smallest parameter variance in the 50 samplingand uses the mode of their α1 estimate as the α1 for thewhole tumor sample (Figure 1e,f). With the sample α1, afew EM iterations in each bin could quickly converge onthe m1 and m2 estimates and read assignment across thegenome. To avoid local maxima of EM, MethylPurifystarts from two distinct initial values of m1 and m2 ineach bin, representing α1 component being hyper- andhypo-methylated, and the convergence point with higherlikelihood is selected as the final prediction (see Methodssection for details).The output of MethylPurify will report the mixing
ratio of the two components (α1: 1 - α1) in the wholesample and the methylation level of each component(m1 and m2) in each qualifying bin across the genome.MethylPurify could also detect differentially methylatedregions (DMRs) as consecutive differentially methylatedbins (DMBs).
Inference of mixing ratio from simulated mixture ofbisulfite reads from tumor and normal cell linesTo validate MethylPurify in estimating the mixing ratio, weused simulated mixture of whole genome bisulfite sequen-cing data from two separate breast cell lines [22]. HCC1954cell line (thereafter refer to as HCC) is derived from an es-trogen receptor (ER)/progesterone receptor (PR) negativeand ERBB2 positive breast tumor, and human mammary
epithelial cell line (HMEC) is immortalized from normalbreast epithelial cells. Bisulfite sequencing for the two celllines have slightly different read lengths (approximately 70to 100 bp) and sequencing coverage (27-fold and 20-fold,respectively). We randomly sampled bisulfite reads fromthe two cell lines at 20-fold total coverage with varyingmixing ratios from 0:1 (all HMEC) to 1:0 (all HCC) with astep of 0.05.We first examined how the parameter estimation varies
with changing inputs. At different mixing ratios, the aver-age variance (of all qualifying bins by bootstrapping) ofthe minor component percentage α1 is very small andstable (Figure 2a). The variance of α1 initially increaseswith the mean of α1, but is suppressed as α1 approaches0.5 since α1 is designated as the minor component to bealways ≤0.5 in our model. In contrast, the estimatedmethylation level of the minor component m1 is the mostvariable. This is reasonable because at low α1 (close to 0),the minor component has very little read coverage; at highα1 (close to 0.5), it is sometimes difficult to determinewhich component is minor so m1 could fluctuate depend-ing on whether MethylPurify assigns the methylated orunmethylated reads to the minor component.Since m1 is the most variable among the three parame-
ters and dominates the sum of the variances, MethylPur-ify later only uses the standard deviation (stdev) of m1
from bootstrapping to rank all qualifying bins. Indeed,
Figure 1 Overview of MethylPurify. (a) A differentially methylated region (DMR) between tumor and normal cells. Solid and hollow red circlesrepresent methylated and unmethylated cytosines, respectively. (b) Short reads from two cell populations after bisulfite treatment and sonication.(c) A library of bisulfite reads in a mixture of two cell populations. (d) EM algorithm iteratively estimates three parameters: the minor composition(α1) and the methylation level of each population (m1, m2) in M step, and assigns reads to each population in E step. (e) Among all 300 bp bins,the parameters estimated from informative bins converge on a final mixing ratio estimate. (f) Top, density plot of predicted minor componentfrom selected informative bins. Bottom, separated methylation level of tumor and normal cells based on the predicted mixing ratio, and DMRsare detected as consecutive differentially methylated bins (DMBs).
Zheng et al. Genome Biology 2014, 15:419 Page 3 of 13http://genomebiology.com/2014/15/8/419
Conclusion on BS-‐seq analyses
• Careful in alignments. • Data modeling is different from ChIP/RNA-‐seq: Poisson/NB vs. Binomial models.
• DMR calling needs to consider spatial correlation, coverage and biological variances.
• Single read analysis could be very useful.• A lot of room for method development.
Detecting long-‐range interactions
• So far we have assumed the genome is a long line. • In reality, chromosomes fold into complicated structures in
nucleus. Implications:– Genomic loci far away on chromosome could be close spatially due to
chromosome folding. – This is important for studying gene regulatory mechanisms, e.g.,
detecting enhancers.
• Traditional lower throughput methods: – 3C: Chromosome Conformation Capture. – 5C: Carbon-‐Copy Chromosome Conformation Capture.
• High-‐throughput: Hi-‐C
Hi-‐C experimental procedures
(12, 13). Interestingly, chromosome 18, which issmall but gene-poor, does not interact frequentlywith the other small chromosomes; this agreeswith FISH studies showing that chromosome 18tends to be located near the nuclear periphery (14).
We then zoomed in on individual chromo-somes to explore whether there are chromosom-al regions that preferentially associate with eachother. Because sequence proximity strongly in-fluences contact probability, we defined a normal-
ized contact matrixM* by dividing each entry inthe contact matrix by the genome-wide averagecontact probability for loci at that genomic dis-tance (10). The normalized matrix shows manylarge blocks of enriched and depleted interactions,generating a plaid pattern (Fig. 3B). If two loci(here 1-Mb regions) are nearby in space, wereasoned that they will share neighbors and havecorrelated interaction profiles. We therefore de-fined a correlation matrix C in which cij is the
Pearson correlation between the ith row and jthcolumn of M*. This process dramatically sharp-ened the plaid pattern (Fig. 3C); 71% of the result-ing matrix entries represent statistically significantcorrelations (P ≤ 0.05).
The plaid pattern suggests that each chromo-some can be decomposed into two sets of loci(arbitrarily labeled A and B) such that contactswithin each set are enriched and contacts betweensets are depleted.We partitioned each chromosome
Fig. 1. Overview of Hi-C. (A)Cells are cross-linked with form-aldehyde, resulting in covalentlinks between spatially adjacentchromatin segments (DNA frag-ments shown in dark blue, red;proteins, which canmediate suchinteractions, are shown in lightblue and cyan). Chromatin isdigested with a restriction en-zyme (here, HindIII; restrictionsite marked by dashed line; seeinset), and the resulting stickyends are filled in with nucle-otides, one of which is bio-tinylated (purple dot). Ligationis performed under extremelydilute conditions to create chi-meric molecules; the HindIIIsite is lost and an NheI site iscreated (inset). DNA is purifiedand sheared. Biotinylated junc-tions are isolated with strep-tavidin beads and identified bypaired-end sequencing. (B) Hi-Cproduces a genome-wide con-tactmatrix. The submatrix shownhere corresponds to intrachro-mosomal interactions on chromo-some 14. (Chromosome 14 isacrocentric; the short arm isnot shown.) Each pixel represents all interactions between a 1-Mb locus and another 1-Mb locus; intensity corresponds to the total number of reads (0 to 50). Tickmarks appear every 10 Mb. (C and D) We compared the original experiment with results from a biological repeat using the same restriction enzyme [(C), rangefrom 0 to 50 reads] and with results using a different restriction enzyme [(D), NcoI, range from 0 to 100 reads].
A
B C D
Fig. 2. The presence and orga-nization of chromosome territo-ries. (A) Probability of contactdecreases as a function of ge-nomic distance on chromosome 1,eventually reaching a plateau at~90 Mb (blue). The level of in-terchromosomal contact (blackdashes) differs for different pairsof chromosomes; loci on chromo-some 1 are most likely to inter-act with loci on chromosome 10(green dashes) and least likelyto interact with loci on chromo-some 21 (red dashes). Interchro-mosomal interactions are depletedrelative to intrachromosomal in-teractions. (B) Observed/expectednumber of interchromosomal con-tacts between all pairs of chromosomes. Red indicates enrichment, and blue indicates depletion (range from 0.5 to 2). Small, gene-rich chromosomes tend to interactmore with one another, suggesting that they cluster together in the nucleus.
A B
9 OCTOBER 2009 VOL 326 SCIENCE www.sciencemag.org290
REPORTS
on
Mar
ch 1
6, 2
010
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Hi-‐C data
• Paired end sequencing, each pair is for a pair of interacting regions.
• Usually summarized the counts into a 2D matrix:– First cut genome into N equal sized bins (size depends on sequence depth).
– Summarize the read counts into NxN matrix. The element (i, j) represents the number of pairs with one end from the ith window and the other end from the jth window.
– The counts represent the strength of interaction. – Usually the numbers on diagonal are greater.
Visualize Hi-‐C data in a heatmap
in this way by using principal component analysis.For all but two chromosomes, the first principalcomponent (PC) clearly corresponded to the plaidpattern (positive values defining one set, negativevalues the other) (fig. S1). For chromosomes 4 and5, the first PC corresponded to the two chromo-some arms, but the second PC corresponded to theplaid pattern. The entries of the PC vector reflectedthe sharp transitions from compartment to com-partment observed within the plaid heatmaps.Moreover, the plaid patterns within each chromo-some were consistent across chromosomes: the
labels (A and B) could be assigned on eachchromosome so that sets on different chromo-somes carrying the same label had correlatedcontact profiles, and those carrying different labelshad anticorrelated contact profiles (Fig. 3D). Theseresults imply that the entire genome can be par-titioned into two spatial compartments such thatgreater interaction occurswithin each compartmentrather than across compartments.
TheHi-C data imply that regions tend be closerin space if they belong to the same compartment(Aversus B) than if they do not. We tested this by
using 3D-FISH to probe four loci (L1, L2, L3, andL4) on chromosome 14 that alternate between thetwo compartments (L1 and L3 in compartment A;L2 and L4 in compartment B) (Fig. 3, E and F).3D-FISH showed that L3 tends to be closer toL1 than to L2, despite the fact that L2 lies be-tween L1 and L3 in the linear genome sequence(Fig. 3E). Similarly, we found that L2 is closer toL4 than to L3 (Fig. 3F). Comparable results wereobtained for four consecutive loci on chromosome22 (fig. S2, A and B). Taken together, these obser-vations confirm the spatial compartmentalization
A B C D
E F G H
Fig. 3. The nucleus is segregated into two compartments correspondingto open and closed chromatin. (A) Map of chromosome 14 at a resolutionof 1 Mb exhibits substructure in the form of an intense diagonal and aconstellation of large blocks (three experiments combined; range from 0to 200 reads). Tick marks appear every 10 Mb. (B) The observed/expectedmatrix shows loci with either more (red) or less (blue) interactions thanwould be expected, given their genomic distance (range from 0.2 to 5).(C) Correlation matrix illustrates the correlation [range from – (blue) to+1 (red)] between the intrachromosomal interaction profiles of every pairof 1-Mb loci along chromosome 14. The plaid pattern indicates thepresence of two compartments within the chromosome. (D) Interchromo-somal correlation map for chromosome 14 and chromosome 20 [rangefrom –0.25 (blue) to 0.25 (red)]. The unalignable region around the cen-tromere of chromosome 20 is indicated in gray. Each compartment onchromosome 14 has a counterpart on chromosome 20 with a very similar
genome-wide interaction pattern. (E and F) We designed probes for fourloci (L1, L2, L3, and L4) that lie consecutively along chromosome 14 butalternate between the two compartments [L1 and L3 in (compartment A);L2 and L4 in (compartment B)]. (E) L3 (blue) was consistently closer to L1(green) than to L2 (red), despite the fact that L2 lies between L1 and L3in the primary sequence of the genome. This was confirmed visually andby plotting the cumulative distribution. (F) L2 (green) was consistentlycloser to L4 (red) than to L3 (blue). (G) Correlation map of chromosome14 at a resolution of 100 kb. The PC (eigenvector) correlates with thedistribution of genes and with features of open chromatin. (H) A 31-Mbwindow from chromosome 14 is shown; the indicated region (yellowdashes) alternates between the open and the closed compartments inGM06990 (top, eigenvector and heatmap) but is predominantly open inK562 (bottom, eigenvector and heatmap). The change in compartmen-talization corresponds to a shift in chromatin state (DNAseI).
www.sciencemag.org SCIENCE VOL 326 9 OCTOBER 2009 291
REPORTS
on
Mar
ch 1
6, 2
010
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Overlay with other 1-‐D data
share this feature of classical insulators. A classical boundary elementis also known to stop the spread of heterochromatin. Therefore, weexamined the distribution of the heterochromatin mark H3K9me3 inhumans and mice in relation to the topological domains12,13. Indeed,we observe a clear segregation of H3K9me3 at the boundary regionsthat occurs predominately in differentiated cells (Fig. 2d, e andSupplementary Fig. 11). As the boundaries that we analysed in
Fig. 2d are present in both pluripotent cells and their differentiatedprogeny, the topological domains and boundaries appear to pre-markthe end points of heterochromatic spreading. Therefore, the domainsdo not seem to be a consequence of the formation of heterochromatin.Taken together, the above observations strongly suggest that the topo-logical domain boundaries correlate with regions of the genome dis-playing classical insulator and barrier element activity, thus revealing a
CTCF
H3K4me3
RNA PolII
p300
H3K4me1
HMM state
DI
Domains
1.0
0
0.8
0.60.40.2
0 10 20 30 40 50
1 –
Empi
rical
cu
mul
ativ
e de
nsity
DI (absolute value)
False positive rate 1%
DI (actual)DI (random)
0
10
20
30
40
0 0.5 1.0 1.5 2.0
Med
ian
norm
aliz
edin
tera
ctio
n co
unts
Genomic distance (Mb)
010
020
030
040
050
060
070
0
Nor
mal
ized
inte
ract
ing
coun
ts
Distance of 80-kb
P-value = 1.65 × 10
–126
A
BInteractions downstream
Interactions upstream
A B
Biased upstream
Biased downstream
Degree of bias
FISH probes:
mESC DI
HMM state
FISH probes:
mESC DI
HMM state
‘Intra-domain’ ‘Inter-domain’
Squ
ared
inte
rpro
be d
ista
nce
(d2 )
betw
een
FIS
H p
robe
s
Domain 1 Domain 2Domain
d
e
Putative boundary
Gen
omic
dis
tanc
e (k
b)be
twee
n FI
SH
pro
bes
Genomic distance Measured distanceh i
0
100
Nor
mal
ized
inte
ract
ing
coun
ts
Chr2:
Chr6: 50000000 51000000 52000000 53000000 54000000
2410003K15RikIgf2bp3
Tra2aCcdc126
D330028D13Rik
Stk31 Npy Mpp6Dfna5
Osbpl3
Cycs
5430402O13Rik
Npvf
C530044C16RikMir148a
Nfe2l3Hnrnpa2b1
Cbx3
Snx10
Skap2Hoxa1Hoxa2Hoxa3Hoxa4Hoxa5Hoxa6Mira
Hoxa7
Hoxa9
Mir196bHoxa10Hoxa11Hoxa13
5730457N03Rik
Evx1Hibadh
Tax1bp1
Jazf1
9430076C15Rik
Creb5TrilCpvl
Chn2
50 -
–50 _
5 -0.2 _
5 -0.3 _
5 -0.5_
3 -0.2 _
3 -0.2 _
74500000 74600000
Lnp Evx2Hoxd13Hoxd12Hoxd11Hoxd10Hoxd9Hoxd8
Hoxd3Hoxd4Mir10b
Hoxd1Mtx2
50 -
–50_
Chr11: 96200000 96300000
Hoxb13Gm53
Mir196a-1Hoxb9
Hoxb8
Hoxb7Hoxb6Hoxb5
Mir10aHoxb4
Hoxb3
Hoxb2Hoxb1
Gm11529Skap1
50 -
–50_
Intra Inter
b
a
Inter-domainIntra-domain
Intra-domainHoxb clusterInter-domainHoxd cluster
f g
0
20
40
60
80
100
0.00
0.02
0.04
0.06
0.08
0.10
0.12
c
Figure 1 | Topological domains in themouse ES cell genome. a, NormalizedHi-C interaction frequencies displayed as a two-dimensional heat mapoverlayed on ChIP-seq data (from Y. Shen et al., manuscript in preparation),directionality index (DI), HMM bias state calls, and domains. For bothdirectionality index andHMM state calls, downstream bias (red) and upstreambias (green) are indicated. b, Schematic illustrating topological domains andresulting directional bias. c, Distribution of the directionality index (absolutevalue, in blue) compared to random (red).d, Mean interaction frequencies at allgenomic distances between 40 kb to 2Mb. Above 40 kb, the intra- versus inter-domain interaction frequencies are significantly different (P, 0.005,Wilcoxontest). e, Box plot of all interaction frequencies at 80-kb distance. Intra-domaininteractions are enriched for high-frequency interactions. f–i, Diagramof intra-domain (f) and inter-domain FISH probes (g) and the genomic distancebetween pairs (h). i, Bar chart of the squared inter-probe distance (from ref. 6)FISH probe pairs. mESC, mouse ES cell. Error bars indicate standard error(n5 100 for each probe pair).
hESC DI
IMR90 DI
IMR90 H3K9me3
hESC H3K9me3
hESC domain
IMR90 domain
0
60
0.3
0–500 kb +500 kbBoundaryC
TCF
bind
ing
site
s pe
r 10
kb All CTCF sites31,968
Boundaryassociated
4,846
CTCF
a
b c
1,75
4 sh
ared
bou
ndar
ies
1,15
9 sh
ared
bou
ndar
ies
Boundary± 500 kb
Boundary± 500 kb
0 3.0
log2 (H3K9me3/input)
0 3.0
log2 (H3K9me3/input)
d
CS5 insulator
0.2
0.1
Chr7: 27000000 27500000
SKAP2HOXA1
BC031342HOXA2HOXA3HOXA4HOXA5HOXA6
HOXA7HOXA9HOXA10HOXA11HOXA11ASHOXA13
EVX1
BC034444
HIBADHNS5ATP1
TSL-ATAX1BP1
JAZF1
30 _
–30 _
-
Boundaryseparates two
non-LAD domains
Boundaryseparates twoLAD domains
Boundaryseparates LAD and
non-LAD domain
3.0 –3.0
log2 (Dam–laminB1/Dam)
f
Chr2: 2 Mb hg18138000000 139000000 140000000
THSD7BHNMT
SPOPLNXPH2LOC647012
30 _
–30 _
30 _
–30 _16
_
0 _
16 _
0 _
50
0
Nor
mal
ized
inte
ract
ing
coun
tse
Boundary± 500 kb
Boundary± 500 kb
Boundary± 500 kb
Non-boundaryassociated
27,122
Nor
mal
ized
in
tera
ctin
g co
unts
hESC IMR90 mESC Cortex
DI
Domains
Figure 2 | Topological boundaries demonstrate classical insulator orbarrier element features. a, Two-dimensional heatmap surrounding theHoxalocus and CS5 insulator in IMR90 cells. b, Enrichment of CTCF at boundaryregions. c, The portion of CTCF binding sites that are considered ‘associated’with a boundary (within 620-kb window is used as the expected uncertaintydue to 40-kb binning). d, Heat maps of H3K9me3 at boundary sites in humanand mouse. e, UCSC Genome Browser shot showing heterochromatinspreading in the human ES cells (hESC) and IMR90 cells. The two-dimensionalheat map shows the interaction frequency in human ES cells. f, Heat map ofLADs (from ref. 14) surrounding the boundary regions. Scale is the log2 ratio ofDNA adenosine methylation (Dam)–lamin B1 fusion over Dam alone (Dam–laminB1/Dam).
RESEARCH LETTER
2 | N A T U R E | V O L 0 0 0 | 0 0 M O N T H 2 0 1 2
Macmillan Publishers Limited. All rights reserved©2012
Data analysis
• Normalization. • An easier one: defining domains (regions with higher level of self-‐interaction).
• Harder one: find long-‐range interaction. • Others: infer 3D structures. • Barely touched: comparison (differential domain).
Normalization
• Consider distance between read pairs, GC contents, mappability, etc. to create a baseline of counts (expected number of reads in each elements of the matrix).
• Subtract (or divide) the baseline from the observed counts to get the signals.
• A couple approaches:– Yaffe et al. (2011) Nature Genetics: likelihood based. – Imakaev et. al. (2012) Nature Method: assuming equal visibility at all
loci and do median-‐polish type of correction (iteratively divide the row/column sums).
• Results: usually improved correlation among replicates.
Domain detection
• The genome are organized into different “domains”. • Can be seen as the blocks on diagonal of the heatmap.
• To detect, use the facts that the interactions are higher within a domain, and lower cross domains.
• Still an open statistical problem.
share this feature of classical insulators. A classical boundary elementis also known to stop the spread of heterochromatin. Therefore, weexamined the distribution of the heterochromatin mark H3K9me3 inhumans and mice in relation to the topological domains12,13. Indeed,we observe a clear segregation of H3K9me3 at the boundary regionsthat occurs predominately in differentiated cells (Fig. 2d, e andSupplementary Fig. 11). As the boundaries that we analysed in
Fig. 2d are present in both pluripotent cells and their differentiatedprogeny, the topological domains and boundaries appear to pre-markthe end points of heterochromatic spreading. Therefore, the domainsdo not seem to be a consequence of the formation of heterochromatin.Taken together, the above observations strongly suggest that the topo-logical domain boundaries correlate with regions of the genome dis-playing classical insulator and barrier element activity, thus revealing a
CTCF
H3K4me3
RNA PolII
p300
H3K4me1
HMM state
DI
Domains
1.0
0
0.8
0.60.40.2
0 10 20 30 40 50
1 –
Empi
rical
cu
mul
ativ
e de
nsity
DI (absolute value)
False positive rate 1%
DI (actual)DI (random)
0
10
20
30
40
0 0.5 1.0 1.5 2.0
Med
ian
norm
aliz
edin
tera
ctio
n co
unts
Genomic distance (Mb)
010
020
030
040
050
060
070
0
Nor
mal
ized
inte
ract
ing
coun
ts
Distance of 80-kb
P-value = 1.65 × 10
–126
A
BInteractions downstream
Interactions upstream
A B
Biased upstream
Biased downstream
Degree of bias
FISH probes:
mESC DI
HMM state
FISH probes:
mESC DI
HMM state
‘Intra-domain’ ‘Inter-domain’
Squ
ared
inte
rpro
be d
ista
nce
(d2 )
betw
een
FIS
H p
robe
s
Domain 1 Domain 2Domain
d
e
Putative boundary
Gen
omic
dis
tanc
e (k
b)be
twee
n FI
SH
pro
bes
Genomic distance Measured distanceh i
0
100
Nor
mal
ized
inte
ract
ing
coun
ts
Chr2:
Chr6: 50000000 51000000 52000000 53000000 54000000
2410003K15RikIgf2bp3
Tra2aCcdc126
D330028D13Rik
Stk31 Npy Mpp6Dfna5
Osbpl3
Cycs
5430402O13Rik
Npvf
C530044C16RikMir148a
Nfe2l3Hnrnpa2b1
Cbx3
Snx10
Skap2Hoxa1Hoxa2Hoxa3Hoxa4Hoxa5Hoxa6Mira
Hoxa7
Hoxa9
Mir196bHoxa10Hoxa11Hoxa13
5730457N03Rik
Evx1Hibadh
Tax1bp1
Jazf1
9430076C15Rik
Creb5TrilCpvl
Chn2
50 -
–50 _
5 -0.2 _
5 -0.3 _
5 -0.5_
3 -0.2 _
3 -0.2 _
74500000 74600000
Lnp Evx2Hoxd13Hoxd12Hoxd11Hoxd10Hoxd9Hoxd8
Hoxd3Hoxd4Mir10b
Hoxd1Mtx2
50 -
–50_
Chr11: 96200000 96300000
Hoxb13Gm53
Mir196a-1Hoxb9
Hoxb8
Hoxb7Hoxb6Hoxb5
Mir10aHoxb4
Hoxb3
Hoxb2Hoxb1
Gm11529Skap1
50 -
–50_
Intra Inter
b
a
Inter-domainIntra-domain
Intra-domainHoxb clusterInter-domainHoxd cluster
f g
0
20
40
60
80
100
0.00
0.02
0.04
0.06
0.08
0.10
0.12
c
Figure 1 | Topological domains in themouse ES cell genome. a, NormalizedHi-C interaction frequencies displayed as a two-dimensional heat mapoverlayed on ChIP-seq data (from Y. Shen et al., manuscript in preparation),directionality index (DI), HMM bias state calls, and domains. For bothdirectionality index andHMM state calls, downstream bias (red) and upstreambias (green) are indicated. b, Schematic illustrating topological domains andresulting directional bias. c, Distribution of the directionality index (absolutevalue, in blue) compared to random (red).d, Mean interaction frequencies at allgenomic distances between 40 kb to 2Mb. Above 40 kb, the intra- versus inter-domain interaction frequencies are significantly different (P, 0.005,Wilcoxontest). e, Box plot of all interaction frequencies at 80-kb distance. Intra-domaininteractions are enriched for high-frequency interactions. f–i, Diagramof intra-domain (f) and inter-domain FISH probes (g) and the genomic distancebetween pairs (h). i, Bar chart of the squared inter-probe distance (from ref. 6)FISH probe pairs. mESC, mouse ES cell. Error bars indicate standard error(n5 100 for each probe pair).
hESC DI
IMR90 DI
IMR90 H3K9me3
hESC H3K9me3
hESC domain
IMR90 domain
0
60
0.3
0–500 kb +500 kbBoundaryC
TCF
bind
ing
site
s pe
r 10
kb All CTCF sites31,968
Boundaryassociated
4,846
CTCF
a
b c
1,75
4 sh
ared
bou
ndar
ies
1,15
9 sh
ared
bou
ndar
ies
Boundary± 500 kb
Boundary± 500 kb
0 3.0
log2 (H3K9me3/input)
0 3.0
log2 (H3K9me3/input)
d
CS5 insulator
0.2
0.1
Chr7: 27000000 27500000
SKAP2HOXA1
BC031342HOXA2HOXA3HOXA4HOXA5HOXA6
HOXA7HOXA9HOXA10HOXA11HOXA11ASHOXA13
EVX1
BC034444
HIBADHNS5ATP1
TSL-ATAX1BP1
JAZF1
30 _
–30 _
-
Boundaryseparates two
non-LAD domains
Boundaryseparates twoLAD domains
Boundaryseparates LAD and
non-LAD domain
3.0 –3.0
log2 (Dam–laminB1/Dam)
f
Chr2: 2 Mb hg18138000000 139000000 140000000
THSD7BHNMT
SPOPLNXPH2LOC647012
30 _
–30 _
30 _
–30 _16
_
0 _
16 _
0 _
50
0
Nor
mal
ized
inte
ract
ing
coun
tse
Boundary± 500 kb
Boundary± 500 kb
Boundary± 500 kb
Non-boundaryassociated
27,122
Nor
mal
ized
in
tera
ctin
g co
unts
hESC IMR90 mESC Cortex
DI
Domains
Figure 2 | Topological boundaries demonstrate classical insulator orbarrier element features. a, Two-dimensional heatmap surrounding theHoxalocus and CS5 insulator in IMR90 cells. b, Enrichment of CTCF at boundaryregions. c, The portion of CTCF binding sites that are considered ‘associated’with a boundary (within 620-kb window is used as the expected uncertaintydue to 40-kb binning). d, Heat maps of H3K9me3 at boundary sites in humanand mouse. e, UCSC Genome Browser shot showing heterochromatinspreading in the human ES cells (hESC) and IMR90 cells. The two-dimensionalheat map shows the interaction frequency in human ES cells. f, Heat map ofLADs (from ref. 14) surrounding the boundary regions. Scale is the log2 ratio ofDNA adenosine methylation (Dam)–lamin B1 fusion over Dam alone (Dam–laminB1/Dam).
RESEARCH LETTER
2 | N A T U R E | V O L 0 0 0 | 0 0 M O N T H 2 0 1 2
Macmillan Publishers Limited. All rights reserved©2012
Domain detection by HMM (Dixon et al. 2012, Nature)
• Compute directionality index (DI).
• Run 2-‐state HMM on DI assuming Gaussian emission. • Define domains based on HMM results: a domain starts from
the beginning of a “up” region, and ends at the end of its next “down” region.
W W W. N A T U R E . C O M / N A T U R E | 3 1
SUPPLEMENTARY INFORMATION RESEARCH
Median ~ 454 kb Median ~ 880 kb
0
1,000,000
2,000,000
3,000,000
4,000,0000
1,000,000
2,000,000
3,000,000
4,000,000
Domain Size (bp) Domain Size (bp)
050
100
150
200
250
300
Freq
uenc
y
100
020
030
0
Freq
uenc
y
Directionality Index
0
100
Nor
mal
ized
Inte
ract
ing
Coun
ts
a
b
chr12: 101000000 101500000 102000000 102500000 103000000 103500000 104000000
Ttc8
4930474N09RikFoxn3
1700064M15Rik2610021K21Rik
Tdp1
Kcnk13Psmc1
BC002230Gm10433
Calm1Gm10432
Ttc7b
Rps6ka5Gpr68
Ccdc88cMir1190
Smek1
Smek1D130020L05Rik
Kif4-ps
CatsperbTc2n
Fbln5Trip11
Atxn3Cpsf2
Slc24a4
Rin3
LgmnGolga5
Chga
Itpk1Mir1936
Gm20604Moap1
AK010878
Ubr7Btbd7
Cox8c
Unc79
50 -
-50 _5 -
-5 _
0 -
Lamina Associated Domains Topological Domains
Supplementary,Figure,12.,,Comparison,of,Topological,Domains,with,Lamina,Associated,Domains,(LADs).,,a,#Histogram#showing#the#size#distribution#of#the#topological#domains#and#the#LADs.#Generally,#LADs#are#smaller#in#size#than#topological#domains.#b,#Genome#browser#shot#showing#a#region#on#chromosome#12#with#multiple#topological#domains,#one#of#which#appears#to#be#entirely#lamina@associated,#with#the#remainder#are#non@lamina#associated.#
log Lamin B1 DamIDDamID( )
Median ~ 454 kb Median ~ 880 kb
0
1,000,000
2,000,000
3,000,000
4,000,0000
1,000,000
2,000,000
3,000,000
4,000,000
Domain Size (bp) Domain Size (bp)
050
100
150
200
250
300
Freq
uenc
y
100
020
030
0
Freq
uenc
yDirectionality Index
0
100
Nor
mal
ized
Inte
ract
ing
Coun
ts
a
b
chr12: 101000000 101500000 102000000 102500000 103000000 103500000 104000000
Ttc8
4930474N09RikFoxn3
1700064M15Rik2610021K21Rik
Tdp1
Kcnk13Psmc1
BC002230Gm10433
Calm1Gm10432
Ttc7b
Rps6ka5Gpr68
Ccdc88cMir1190
Smek1
Smek1D130020L05Rik
Kif4-ps
CatsperbTc2n
Fbln5Trip11
Atxn3Cpsf2
Slc24a4
Rin3
LgmnGolga5
Chga
Itpk1Mir1936
Gm20604Moap1
AK010878
Ubr7Btbd7
Cox8c
Unc79
50 -
-50 _5 -
-5 _
0 -
Lamina Associated Domains Topological Domains
Supplementary,Figure,12.,,Comparison,of,Topological,Domains,with,Lamina,Associated,Domains,(LADs).,,a,#Histogram#showing#the#size#distribution#of#the#topological#domains#and#the#LADs.#Generally,#LADs#are#smaller#in#size#than#topological#domains.#b,#Genome#browser#shot#showing#a#region#on#chromosome#12#with#multiple#topological#domains,#one#of#which#appears#to#be#entirely#lamina@associated,#with#the#remainder#are#non@lamina#associated.#
log Lamin B1 DamIDDamID( )
Detecting long-‐range interactions
• The interactions can be seen on the heatmap as bright, off-‐diagonal spots.
• A harder problem, partly because there are not enough reads. • Still an open statistical problem. A simple method is a Poisson
test, with the baseline rates computed from all data:
of the genome inferred from Hi-C. More gen-erally, a strong correlation was observed betweenthe number of Hi-C readsmij and the 3D distancebetween locus i and locus j as measured by FISH[Spearman’s r = –0.916, P = 0.00003 (fig. S3)],suggesting that Hi-C read count may serve as aproxy for distance.
Upon close examination of the Hi-C data, wenoted that pairs of loci in compartment B showeda consistently higher interaction frequency at agiven genomic distance than pairs of loci in com-partment A (fig. S4). This suggests that compart-ment B is more densely packed (15). The FISHdata are consistent with this observation; loci incompartment B exhibited a stronger tendency forclose spatial localization.
To explore whether the two spatial compart-ments correspond to known features of the ge-nome, we compared the compartments identifiedin our 1-Mb correlation maps with known geneticand epigenetic features. Compartment A correlatesstrongly with the presence of genes (Spearman’sr = 0.431, P < 10–137), higher expression [viagenome-wide mRNA expression, Spearman’sr = 0.476, P < 10–145 (fig. S5)], and accessiblechromatin [as measured by deoxyribonuclease I(DNAseI) sensitivity, Spearman’s r = 0.651, Pnegligible] (16, 17). Compartment A also showsenrichment for both activating (H3K36 trimethyl-ation, Spearman’s r = 0.601, P < 10–296) andrepressive (H3K27 trimethylation, Spearman’sr = 0.282, P < 10–56) chromatin marks (18).
We repeated the above analysis at a resolutionof 100 kb (Fig. 3G) and saw that, although thecorrelation of compartment A with all other ge-nomic and epigenetic features remained strong(Spearman’s r > 0.4, P negligible), the correla-tion with the sole repressive mark, H3K27 trimeth-ylation, was dramatically attenuated (Spearman’sr = 0.046, P < 10–15). On the basis of these re-sults we concluded that compartment A is moreclosely associated with open, accessible, activelytranscribed chromatin.
We repeated our experiment with K562 cells,an erythroleukemia cell line with an aberrant kar-yotype (19). We again observed two compart-ments; these were similar in composition to thoseobserved in GM06990 cells [Pearson’s r = 0.732,
Fig. 4. The local packing ofchromatin is consistent with thebehavior of a fractal globule. (A)Contact probability as a functionof genomic distance averagedacross the genome (blue) showsa power law scaling between500 kb and 7 Mb (shaded re-gion) with a slope of –1.08 (fitshown in cyan). (B) Simulationresults for contact probability asa function of distance (1 mono-mer ~ 6 nucleosomes ~ 1200base pairs) (10) for equilibrium(red) and fractal (blue) globules.The slope for a fractal globule isvery nearly –1 (cyan), confirm-ing our prediction (10). The slopefor an equilibrium globule is –3/2,matching prior theoretical expec-tations. The slope for the fractalglobule closely resembles the slopewe observed in the genome. (C)(Top) An unfolded polymer chain,4000 monomers (4.8 Mb) long.Coloration corresponds to distancefrom one endpoint, ranging fromblue to cyan, green, yellow, or-ange, and red. (Middle) An equi-librium globule. The structure ishighly entangled; loci that arenearby along the contour (sim-ilar color) need not be nearby in3D. (Bottom) A fractal globule.Nearby loci along the contourtend to be nearby in 3D, leadingto monochromatic blocks bothon the surface and in cross sec-tion. The structure lacks knots.(D) Genome architecture at threescales. (Top) Two compartments,corresponding to open and closedchromatin, spatially partition thegenome. Chromosomes (blue, cyan,green) occupy distinct territories.(Middle) Individual chromosomesweave back and forth betweenthe open and closed chromatincompartments. (Bottom) At thescale of single megabases, the chromosome consists of a series of fractal globules.
A
C D
B
9 OCTOBER 2009 VOL 326 SCIENCE www.sciencemag.org292
REPORTS
on
Mar
ch 1
6, 2
010
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
Comparison, e.g., differential interaction
• Barely touched (people still struggle with domains and interactions).
• Conceptually, one want to compare the interactions between different samples, e.g., locus A interacts with locus B in normal cell but not in cancer.
• For an element in the matrix, can we take the counts then use RNA-‐seq DE test methods?– No! Because the backgrounds could be different. This is similar to
ChIP-‐seq differential binding problem.– Also neighboring elements in the matrix need to be combined to make
inference (like in ChIP-‐seq, but combine in 2-‐D), so some (kernel) smoothing is needed.
Construct 3D structure
• BACH (Bayesian 3D constructor for Hi-‐C data), Hu et al. (2013) PloS CB – The read counts represent the physical distances between pairs of loci on the genome.
– Given these distances the 3D structure can be estimated. – Based on a Poisson model, and with some constraints, the 3D coordinates of each bin on the genome can be estimated.
– Estimation procedure is based on MCMC.
Conclusion on Hi-‐C data
• Technology to detect chromosomal interactions using sequencing.
• Usually requires more reads.• Still in very early infancy in terms of analysis methods. A lot of room for development.
A grand overview of the class
• The technologies and statistical methods for:– Gene expression microarrays and a little bit ChIP-‐chip.– Second-‐generation sequencing: ChIP-‐seq and RNA-‐seq.
• Bioconductor tools for analyzing genomic data, including:– Biostrings, BSgenome, GenomicRanges, GenomicFeatures for general
genomic data.– A little bit of Rsamtools for sequencing data.– Several Biocpackages for DE/DM analyses in:
• microarray: siggenes, limma.• RNA-‐seq: DESeq, edgeR, DSS.• BS-‐seq: bsseq, DSS
• Some software tools for analyzing sequence data:– bowtie: alignment. – samtools: for manipulating SAM/BAM files.