A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury...

A method to recognize universal patterns in genome structure using Hi-C

Neil Chowdhury Phillips Exeter Academy

Mentor: Sameer Abraham

Massachusetts Institute of Technology

Introduction to Hi-C

Neil Chowdhury 2

Hi-C Matrix of Rao et al. (2014) GM12878 genome

Chromatin structure in the cell nucleus

Region 1

Region 2

Hi-C process

• Hi-C: Process of finding contact probabilities between every pair of regions in a DNA strand

• Graphed as a square adjacency matrix (rows/cols are regions, cells are probabilities) • Contact probabilities measures the interaction frequency of two regions

Patterns in Hi-C

Datasets used: Schwarzer et al. (2017) NIPBL mutant, Rao et al. (2014) GM12878 Neil Chowdhury 3

Large scale Cis interactions

chr2 chr3

Compartmentalization

Neil Chowdhury 4

cis trans

• Checkerboarding pattern • Seen in both cis and

trans interactions • Pattern captured by sign

of eigenvector • Clear that there are

subcompartments within A/B

eigenvector

Methodology and Challenges

• Methodology: Use existing clustering techniques from data science to find subcompartments in Hi-C

• Challenges:

• Process Hi-C data into a form amenable to existing clustering algorithms • Assess the quality of clusters • Assess the number of clusters

Neil Chowdhury 5

What is clustering?

Neil Chowdhury 6

(MathWorks)

• Clusters are tightly packed groups of points in space

• Clustering is the process of

algorithmically finding these groups of points

Clustering algorithms

• K-means • 𝑘𝑘 centroids • each point is in the cluster with the nearest centroid (means) • minimize variance (squared Euclidean distance) within each

cluster • Agglomerative

• start with each point in its own cluster • repeatedly merge a pair of clusters by some linkage criterion

(single, ward, average) until 𝑘𝑘 clusters reached • Spectral

• create an affinity matrix, compute eigenvectors, use k-means to cluster eigenvectors

Neil Chowdhury 7

(images from Wikipedia)

K-means algorithm

Agglomerative algorithm

(finding 𝑘𝑘 clusters in a set of 𝑛𝑛 points in space)

Example of Clusters (Wikipedia)

Creating a matrix for clustering

“odd-even matrix”

Even chromosomes

chr1 trans

chr4 …

Neil Chowdhury 8

• Problem with cis interactions: scaling of contact probability and TADs interfere with compartmentalization

• Construct a matrix with all odd chromosomes vs. all even chromosomes (all trans interactions). Technique used in Rao et al. (2014).

• Clustering rows: treat rows as points in

thousand-dimensional space and columns as dimensions

• Clustering columns: treat columns as

points and rows as dimensions

Visualization of cluster labels Column cluster labels (sorted)

Neil Chowdhury 9

1. Run clustering procedure on rows and columns

2. Sort matrix by row and column according to cluster label

3. Plot matrix 4. Add clustering labels

to the top and left

Eigendecomposition vs. clustering (k=2)

Agglomerative (average linkage) K-means

Spectral

Neil Chowdhury 10

Labels obtained from clustering

Labels obtained from eigenvector sign

• Sort by cluster label, but show eigenvector sign on bottom and right

(A: +/red, B: -/blue)

• Each clustering method is able to find the two compartments

Stability original

perturbed

clusters clusters

Rand index (ARI)

Neil Chowdhury 11

• 5% gaussian noise • 75% subsample

Stability metric from Luxburg (2010).

(specify 𝑘𝑘)

K-means

Neil Chowdhury 12

k=2 k=3

k=4 k=5

k=6 k=7

• k<=4 is most stable

• k=6, k=7 clusters look very similar

subsampling

Agglomerative • Linkage measures the distance between clusters • Average linkage most stable • Stability increases as 𝑘𝑘 gets larger

Neil Chowdhury 13 subsampling noise

Agglomerative (average linkage)

Neil Chowdhury 14

k=30 k=40

k=5 k=10 k=20

• k >= 20: four clusters found, similar to K-means with k=4

Spectral

Neil Chowdhury 15

k=5 k=4 • K=2 to 7 most

stable • A/B

compartments not replicated

subsampling

Dimensionality reduction

subsampling

noise Neil Chowdhury 16

• Dimensionality reduction using principal component analysis (PCA) improves stability of K-means

• Parallel coordinates plot: 4 clusters have distinct positions in 25-dimensional space (each zig-zag line represents a point)

K-means (4)

Negative (B) Positive (A)

Neil Chowdhury 17

• Supported by stability analysis • Replicates the results of

agglomerative clustering

Eigenvector:

ChIP-seq Hi

Repli-seq

Neil Chowdhury 18

• A2, B1, B2 deficient in H4k20me1 • B regions generally deficient in all

proteins

Histones (Merriam-Webster)

• Replication order: A1, A2, B1 = B2

ChromHMM states

Neil Chowdhury 19

• ChromHMM (hidden Markov model) running on ChIP-seq classifies DNA into 15 states (promotors, enhancers, quiescent, transcription start sites, etc.)

• B1 mainly made up of Quies state (light gray)

ChromHMM states

Conclusion

• Demonstrated that there are four nuclear subcompartments with distinctive features

• Framework for clustering Hi-C data

Neil Chowdhury

eigenvector

stability

visual inspection

clusters

ChIP-seq

ChromHMM

Repli-seq

# of clusters

odd-even matrix

Future Work

• Compare to physics models of chromatin • Write software pipeline to find the compartments (A1, A2, B1, B2)

given any Hi-C matrix • More sophisticated graph-based clustering techniques (simulated

annealing) • More fine-grained analysis with resolution higher than 1mb

Neil Chowdhury 21

Acknowledgements

• ENCODE project • Hi-C matrices, ChIP-seq, Repli-seq, ChromHMM

• Sameer Abraham for mentoring me • Martin Falk and Prof. Leonid Mirny for inviting me to MIT MirnyLab

and providing me with computational resources • Dr. Slava Gerovitch and Prof. Srini Devadas for inviting me to MIT

PRIMES-CS

Neil Chowdhury 22

Neil Chowdhury 23

ChIP-seq/Repli-seq enrichments compared to Rao et al. clusters

Neil Chowdhury 24

Rao et al. (2014)

Our subcompartments

Distribution of logged trans contact probabilities

Neil Chowdhury 25

Adjusted Rand Index (ARI)

• X and Y are two clusterings of the same set. Xi and Yi each represent a cluster.

• ARI measures similarity between the two clusterings

Neil Chowdhury 26

Contingency table (Wikipedia)

ARI formula (Wikipedia)

NIPBL vs. Untreated (UNTR)

• Mouse liver cells • NIPBL is a cohesion loading protein

thought to be responsible for loop extrusion

Neil Chowdhury 27

ChromHMM

Neil Chowdhury 28

Specify number of chromatin states

(Wikipedia)

Observations: ChIP-seq tracks

Agglomerative (single linkage, k=5)

Neil Chowdhury 29

Comparison with human cells

Neil Chowdhury 30

GM12878 (Human cell line) NIPBL factor removed

Mouse liver

Effect of Dimensionality Reduction

Neil Chowdhury 31

A method to recognize universal patterns in genome ...€¦ · Introduction to Hi-C Neil Chowdhury...

Documents

The Cancer Genome Atlas Clinical ... - Genome Medicine

Genome Annotation - NDSUmcclean/plsc411/Genome... · Genome Annotation . Genome Sequencing • Costliest aspect of sequencing the genome o But Devoid of content • Genome must be

HI-3110, HI-3111, HI-3112, HI-3113

HI-546 HI-547 HI-548 HI-549 Datasheet | Renesas

Personal Genome Project Study Guide Personal Genome Project Study Guide Personal Genome

Genome Sequence Informatics & Comparative Genome Sequence Analysis

Towards your own genome. Designing your Sequencing Run Sequencing strategy Genome size and genome

bin3C : Exploiting Hi-C sequencing data to accurately resolve … · 2020. 3. 14. · Metagenomics, Hi-C, clustering, next generation sequencing, metagenome-assembled genome 3 It

HI 98140 HI 98150 HI 98160

SHORT GENOME REPORT Open Access Complete genome …

Genome mapping Genome sequencing Next Gen sequencing

Profiling chromatin accessibility at single cell resolution · ATAC Solution from the same sample and genomic locus of GM12878 cell line. Open chromatin signals were detected in 97%

Hi-C: Genome-wide Chromosome Conformation Capture · Hi-C: Genome-wide Chromosome Conformation Capture Lisa Boxer Biochem 218 December 5, 2010 The three-dimensional conformation of

Whole genome duplication and plant genome diversity.spare

HI-506 HI-507 HI-508 HI-509 Datasheet | Renesas

84old-book.ru.ac.th/e-book/h/HI222/hi222-2.pdf · 2001. 4. 29. · 124 HI 222. HI 222 125. 126 HI 222. HI 222. 128 HI 222. HI 222 129. 130 HI 222. HI 222 131. 132 HI 222. HI 222

Chromatin Compaction. INTRODUCTION Difference between procaryotic and eucaryotic genome -E. Coli: 1X -Yeast genome: 4X -Fruit fly genome: 40X -Human genome:

HI-506, HI-507, HI-508, HI-509 Datasheet...2009/07/08 · HI-506, HI-507, HI-508, HI-509 FN3142 Rev 10.00 Page 6 of 25 Jun 14, 2016 Functional Diagrams HI-506 HI-507 HI-508 HI-509

Melampsora Genome Annotation and Genome Structure Analysis

Genome-wide oligonucleotide-based array comparative genome