View
0
Download
0
Category
Preview:
Citation preview
Supervised Learner for the Prediction of Hi-C Interaction Counts and Determination of Influential Features
Tyler Derr @ Yue Labtsd5037@psu.edu
Background● Hi-C is a chromosome conformation capture (3C) based technology, which
outputs the number of interactions between loci at the genome-wide scale.
[3]
Background● Recent 3D prediction softwares such as BACH[1] and PASTIS[3] exist that
can use Hi-C data to produce 3D genome structures○ BACH:
“It utilizes a Poisson model that better fits the count data generated from Hi-C experiments than the Gaussian model used in MCMC5C[4]...”[1]
○ PASTIS:In [3] they present 4 methods and 2 of these are based upon a Poisson model.
● Thanks to the recent efforts of the ENCODE and Roadmap Epigenomics projects we have access to the following data per region (40kb resolution):
○ GC content, mappability, number of HindIII cut sites, Pol II, and 6 histone modifications such as H3k36me3
Basis of our research● Current softwares such as BACH[1] and PASTIS[3] that can predict 3D genome
structures based on Hi-C data have trouble dealing with the bias induced by the techniques to gather the data
○ Hi-C data collection is time consuming, expensive, and have known biases
○ It seems that Dr. Ming Hu (the creator of BACH[1]) had attempted to address the biases by taking into account the enzyme cutting frequency, GC content, and sequence uniqueness when making his 3D predictions
○ However Dr. Hu has recently stated that due to a recent Nature paper the assumptions on a Poisson distribution (which is crucial to BACH) is not appropriate for Hi-C data and therefore invalidating any approach using a Poisson distribution assumption. [2]
● Can we use Machine Learning techniques to not only alleviate the bias, but also perhaps predict the Hi-C data?
Predicting Hi-CWe present two methods:
Method 1: Using the entire Hi-C matrix as training data for a single
Random Forest (RF) and also a single Artificial Neural Network (ANN)
Method 2: Creating a separate RF for each diagonal of the matrix
(i.e. Any given RF will only be trained on region pairs of a fixed distance.) (e.g. RF_2 will be trained on all region pairs that are 2 regions away, 80kb)
The reasoning behind Method 2 is that it will provide us with knowledge into what features are more meaningful for prediction at different distances.
Predicting Hi-CWe use mESC mm9 chrs to train and validate our models
Data Features used to Learn Hi-C: 10 for each 40kb regionGC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF
Method 1: Using RF and ANN● Training input for predicting the interaction of two regions rI and rJ consists of the 10
features for both the regions ○ plus an additional feature of the distance between the regions○ [rI.GC, rI.HindIII, … , rI.CTCF, rJ.GC, rJ.HindIII, … ,rJ.CTCF, distance]
● Attempting to use the above features to predict the Hi-C interaction value between the two given regions rI and rJ for all pairs of regions in the chr.
Predicting Hi-CWe use mESC mm9 chrs to train and validate our models
Data Features used to Learn Hi-C: 10 for each 40kb regionGC content, number of HindIII cut sites, mappability, H3k4me1, H3k4me3, H3K27ac, H3K27me3, H3k36me3, Pol II, and CTCF
Method 2: Using RF● Attempting to use the above features to predict the Hi-C interaction value between
the two given regions rI and rJ for all pairs of regions for a specific distance in the chr.
● e.g. For training model RF_2 we use all pairs of regions which are 80kb in distance Input for predicting the interaction of two regions rI and rJ consists of the 10 features for both the regions (and not using the distance)○ [rI.GC, rI.HindIII, …, rI.CTCF, rJ.GC, rJ.HindIII, …,rJ.CTCF] ➔ Interaction of
rI,rJ
What we have so far...● Method 1: Training on all pairs of regions from chr1 and
testing our model with all pairs of chr2○ RMSE=2.309 & R-squared=0.869
What we have planned for the near future:○ Performing a leave-one-out cross validation with
using all the mESC mm9 chrs○ Using higher resolution 1kb region sizes
Scatter Plot of Real vs Predicted Hi-C Data
Rea
l Int
erac
tion
Val
ues
Predicted Interaction Values
300
600
300 600
3D Structure of mESC mm9 chr2Using Predicted Hi-C Using raw Hi-C3D models generated
using PASTIS (MDS)3D prediction software [3]
Coloration corresponds to the distance from the starting point of the chr (blue, cyan, green, yellow, orange, red)[2]
Hi-C Heatmaps of mm9 Chr2 - (Entire Chr)
Predicted Data Real Data
Hi-C Heatmaps of mm9 Chr2 - (0 - 40Mbp)
Predicted Data Real Data
Feature ImportancesAnother part of our project is to attempt at determining which of the 10 features are more meaningful in determining the interaction between the loci regions Question:
Are there differences in which features are more significant for the Hi-C values of paired regions that are close compared to far away interactions?
Feature Importances
40kbH3k36me3_norm = 0.3571
HindIII = 0.2871
Map = 0.1062
H3k27ac_norm = 0.0505
POL2_norm = 0.0453
H3k4me1_norm = 0.0359
H3k27me3_norm = 0.0358
GC = 0.0295
CTCF_norm = 0.0269
H3k4me3_norm = 0.0258
totals 100.03%
2MbpHindIII = 0.238
Map = 0.1686
H3k27me3_norm = 0.0944
POL2_norm = 0.0862
GC = 0.0794
H3k36me3_norm = 0.0721
CTCF_norm = 0.0711
H3k4me3_norm = 0.0642
H3k4me1_norm = 0.0632
H3k27ac_norm = 0.0606
totals = 99.78%
Using Method 2: Feature importances (in sorted order) for predicting the interaction between regions which are 40kb vs 2Mbp in distance
Note: These values are obtained by analysis on the Decision Trees in a Random Forest model.
The feature importances are calculated by randomly permuting the values for a single feature among the training instances. The more the variation in prediction accuracy when using the correct feature values vs the permuted values imply that the feature is more meaningful/important for the prediction.
Feature Importances
Feature ImportancesFuture Work Idea:
Use data mining techniques to determine more information behind the correlation of features (and also pairs of features) to the Hi-C interaction values
Thank you
References[1] Hu, Ming, et al. "Bayesian inference of spatial organizations of chromosomes."PLoS computational biology 9.1 (2013): e1002893.
[2] Kuang, Simon 2014 Google Science Fair Poster
[3] Lieberman-Aiden, Erez, et al. "Comprehensive mapping of long-range interactions reveals folding principles of the human genome." science 326.5950 (2009): 289-293.
[4]Rousseau, Mathieu, et al. "Three-dimensional modeling of chromatin structure from interaction frequency data using Markov chain Monte Carlo sampling."BMC bioinformatics 12.1 (2011): 414.
[5] Varoquaux, Nelle, et al. "A statistical approach for inferring the 3D structure of the genome." Bioinformatics 30.12 (2014): i26-i33.
Recommended