29
Journal report: High Resolution Model of Transcription Factor- DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: Phadera Gius, Aaron Arvey, William Chang, William Stafford Noble, Christina Leslie Memorial Sloa-Kettering Cancer Center, NY Presented by Yaron Orenstein for ACGT group meeting, 19 January 2011

Paper by: Phadera Gius , Aaron Arvey , William Chang, William Stafford Noble, Christina Leslie

  • Upload
    holden

  • View
    50

  • Download
    3

Embed Size (px)

DESCRIPTION

Journal report: High Resolution Model of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions. Paper by: Phadera Gius , Aaron Arvey , William Chang, William Stafford Noble, Christina Leslie Memorial Sloa -Kettering Cancer Center, NY. - PowerPoint PPT Presentation

Citation preview

Page 1: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Journal report: High Resolution Model of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions

Paper by: Phadera Gius, Aaron Arvey, William Chang, William Stafford Noble,

Christina LeslieMemorial Sloa-Kettering Cancer Center, NY

Presented by Yaron Orenstein for ACGT group meeting, 19 January 2011

Page 2: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Introduction – Biological Background• Gene regulatory programs are orchestrated by

transcription factors (TFs).• These proteins usually bind to binding sites

(BSs) in the promoter region and enable or impend transcription of the gene.

• Accurately modeling the DNA sequence preferences of TFs is a key piece in unraveling the regulatory code.

Page 3: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Modeling BSs: PSSM model• The most popular model to represent binding

sites is the PSSM: position specific scoring matrix.

• These motifs may match thousands of sites in intergenic regions, producing an unreliable list of potential TF target genes.

123456A0.10.800.70.20C00.10.50.10.40.6G000.50.10.40.1T0.90.100.100.3

Page 4: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

All possible 8-mers model

• This model contains a list of all possible 8-mers ranked by the TF preference.

• This information can be obtained for example from PBM data and calculating an enrichment-score for each 8-mer.

• The disadvantage is clearly its large size and uninterpretability. In addition, the sequence similarities between 8-mers is not considered.

Page 5: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Protein Binding Microarray data• PBM array contains ~41,000 probe sequence

of length 35bp each, covering all possible DNA 10-mers.

• For each probe the binding intensity is reported.

Page 6: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Support vector regression

• Motivation: predict real values based on a feature set.

• Given a training set , find a function f which best predicts y.

• For example, if f is linear, then f(x) = <w,x>+b, where w is the set of feature weights.

• is minimized under some error constraints.

Page 7: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Example for SVR• A simple way to predict binding intensity from

PBM data based on 8-mer features.• Use indicator features for each 8-mer:– 1 if sequence x contains the 8-mer.– 0 if it does not.

Page 8: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

An overview

Page 9: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Methods• They developed a training strategy for the SVR

model that involves three key components:1. The choice of kernel.2. The sampling procedure for selecting the most

informative training sequences.3. The feature selection method.

Page 10: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

The di-mismatch kernel• Let be a set of unique k-mers that

occur in the set of training sequences.• Define the set of substrings of length k in s (of

length N:• Then s is represented by the feature vector:

• And counts the number of matching dinucleotides between and .

Page 11: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Example for the di-mismatch kernel

•Two non-consecutive pair of mismatches lead to a count of mismatches 6:

4 consecutive mismatches lead to a count of 5:

Page 12: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Sampling PBM data to obtain an informative training set

• They selected the set of “positive” training probes to be those sequences associated with normalized binding intensities Z ≥ 3.5.

• If there were more than 500, they selected the top 500 ranked by their binding signals.

• The same number of “negative” training probes was selected from the other end of the distribution.

Page 13: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Feature Selection

• They selected the feature set to be those k-mers that are over-represented either in the “positive” or “negative” probe class

• They computed the mean di-mismatch score for each k-mer in each class and ranking features by the difference between these means.

• They used at most 4000 k-mers.

Page 14: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Results

• First, they tested how well they predict the ranking of probe sequences of one PBM array based on learning from another PBM array.

• They used the metric of: Top 100, meaning how many of the top 100 probes were ranked to be in the top 100 by the model.

• They compared to PSSM and E-Score (full 8-mers list) models.

Page 15: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

• The left scatter plot shows the detection of the top 100 probes using maximum E-scores (x-axis) and the SVR model (y-axis) in the prediction of in vitro TF binding preferences. Each point corresponds to one TF.

• The right panel is similar to the left, but compares the SVR versus PBM-derived PSSMs for the 114 mouse TFs.

Page 17: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Prediction of in-vivo occupancy

• They computed the binding occupancy using a sliding 36-mer window for scoring.

• They compared to:1. PSSM. Log-odds scores were used.2. E-score over a fixed threshold.3. E-score based occupancy (using the median

probe intensity of PBM probes containing the highest-scoring 8-mer pattern).

Page 18: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

• Predicted binding profile for:– (left) yeast TF Ume6 along IGR iYFL022C– (right) yeast TF Gal4 along IGR iYFR026C

Page 19: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

They computed the detection of the top 200 inter genomic regions by the top 200 predictions, where the top 200 “bound” IGRs were determined by their p-value ranking.

• Prediction of in vivo is weak to very poor (due to indirect and competitive binding as well as other factors).

• Still, in 8 out of 9 example the SVR method outperforms the occupancy score method of Zhu et al. (2009).

• Against PSSM model it was: 6 wins, 1 ties, 2 losses.

Page 21: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Testing on ChIP-seq data

• They selected 1000 confident peak regions (60bp each) and 1000 “negative” regions from flanking sequences (60bp regions 300bp away from the peaks).

• Model performance measured by area under the ROC curve (AUC), using the maximum SVR prediction score (over 36-mer windows) to rank ChIP-seq 60-mers.

• ROC = true positive rate vs. false positive rate.

Page 22: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

SVRs trained on PBM arrays are able to capture ChIP-seq peaks better than PSSMs or the occupancy score.

Page 23: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Support Vector Machines

• Here we want to classify the data to binary classes, i.e. the training set is

Page 24: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Training discriminative models on ChIP-seq data

• Trained SVMs using the (13,5) parameters on 60-mer ChIP-seq peaks (positive sequences) and flanking negative sequences.

• Evaluation by computing AUCs on the same test sets of 1000 ChIP-seq peaks and 1000 flanking negative sequences using 10-fold cross-validation.

• Tested against Weeder and Mdscan, which determine overrepresented k-mer and PSSM motifs, respectively.

Page 25: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

• SVMs trained on ChIP-seq data capture sequence information from the genomic context of ChIP-seq peaks and improve in vivo prediction performance.

• There was no advantage to training regression models on ChIP-seq peaks label with real-valued occupancy.

Page 26: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

PBM experiments may capture in vivo preference

• To investigate how some PBMs contain 2 different binding sites, they did:

1. Cluster k-mer features based on their co-occurrence in the training sequences.

2. Projected highly weighted k-mers into 2 dimenstions using principal component analysis (PCA)

• Two clusters were found, each representing a different motif.

• The SVR was trained on the features of each motif separately and the AUCs were 0.75 and 0.54.

Page 27: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

• K-mers contributing to the (left) Oct4 PBM model and (right) Sox2 ChIP model, where each point represents a 13-mer and is colored according to its model weight. Star and circle point styles indicate different clusters.

• For the PBM derived model, the clusters represent the primary and secondary binding motifs

• For the ChIP-derived model, the clusters correspond to the motifs for Sox2 and its cofactor Oct4.

Page 28: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Summary

• A flexible new discriminative framework for learning TF binding models from high resolution in vitro and in vivo data.

• The SVR/SVM models better predict binding affinity and thus are more suitable for representing complex regulatory regions.

Page 29: Paper by:  Phadera Gius , Aaron  Arvey , William Chang, William Stafford Noble, Christina Leslie

Possible directions to continue

1. Training jointly on PBM and ChIP-seq data for the same TF.

2. Develop multi-task training strategies for modeling the binding preferences of a class of structurally relate TFs using features of the amino acid sequence.

3. Combine in vivo TF sequence preference models with data on chromatin state to predict TF target genes in new cell types.