Conference Report: Recomb Satellite NYC, Nov 2010

Conference Report:Recomb SatelliteNYC, Nov 2010

DREAM ,Systems Biology and Regulatory Genomics

http://upload.wikimedia.org/wikipedia/commons/d/da/NYC_Montage_12_by_Jleon.jpg

DREAM reverse engineering challenges

• The first part was DREAM, which is a reverse engineering competition.

• It was one day in which the best performers of the different challenges presented their solutions.

• I was invited to present as best performer of the bonus round of challenge 2.

• When comparing to the other best performer, our method was as accurate and much faster.

Reminder of the challenge

Our predictions against the

gold standard

TF number Gold Standard ACGT team answerTF_1 Ar ArTF_2 Dbp Ces2TF_3 Foxo6 Foxf2TF_4 Klf12 Klf16TF_5 Klf8 Klf7TF_6 Klf9 Mybl2TF_7 Mlx MycTF_8 Mzf1 Nfkb1TF_9 Mzf1 Nfkb2

TF_10 Nfil3 Nfil3TF_11 Nr2f6 Nr2f1TF_12 Nr4a2 PpargTF_13 Pou2f1 Pou2f1TF_14 Mypop Pou3f2TF_15 Pou1f1 Pou3f4TF_16 Prdm11 Pou5f1TF_17 Rorb RoraTF_18 Sox10 Sox2TF_19 Sox3 Sox5TF_20 Sox6 Sox9TF_21 Srebf1 Srebf1TF_22 Tbx2 Tbx1TF_23 Tbx20 Tbx21TF_24 Tbx4 Tbx5TF_25 Tbx5 Tbx6TF_26 Tcfec Usf26TF_27 Xbp1 Xbp1TF_28 Zfp202 Zfp281TF_29 Zfp263 Zfp691TF_30 Zfp3 Zfpm1TF_31 Zfx ZfxTF_32 Zkscan1 Zscan4eTF_33 Zscan10 Zscan4f

TF number Gold Standard ACGT team answerTF_34 Ahctf1 Arid3aTF_35 Atf3 Atf1TF_36 Atf4 Atf2TF_37 Dnajc21 Cfl2TF_38 Dmrtc2 Dmrt2TF_39 Egr3 Egr1TF_40 Esrrb Esr2TF_41 Esrrg EsrraTF_42 Foxc2 Foxj1TF_43 Foxg1 Foxl1TF_44 Gata4 Gata5TF_45 Mybl2 Mybl1TF_46 Nhlh2 Nhlh1TF_47 Nkx2-9 Nkx3-2TF_48 Nr2e1 Nr1h2TF_49 Nr2f1 PpardTF_50 Nr5a2 PpargTF_51 Pou1f1 Prrx2TF_52 Rarg RaraTF_53 Rfx7 Rfx4TF_54 Rora RxraTF_55 Sdccag8 TbpTF_56 Snai1 Tgif1TF_57 Sp140 Tgif2TF_58 Tbx1 Tgif2lx1TF_59 Zbtb1 Tgif2lx2TF_60 Zfp300 Xbp1TF_61 Zfp637 Zfp128TF_62 Zic5 Zic1TF_63 Zkscan5 Zic2TF_64 Zfp740 Zic3TF_65 Zscan10 Zic4TF_66 Zscan10 Mzf1

Systems Biology and Regulatory Genomics

• DREAM was followed by:1. Systems Biology • Pathway inference and reverse engineering of cellular networks. • Cellular signatures of biological responses and disease states. • Phosphorylation, metabolic fluxes, systematic phenotyping. • Mathematical modeling and simulation of biological systems.

2. Regulatory Genomics• Modeling and recognition of regulatory motifs and modules. • Chromatin state establishment, maintenance, and role in

development. • Post-transcriptional regulation and small regulatory RNAs. • Regulatory networks, metabolic networks, proteomic networks.

Computational Identification of specific cis-regulatory elements

using sequence and expression data

Rahul Karnik, Michale BeeerDepartment of Biomedical Engineering, Johns

Hopkins University School of Medicine

Introduction• Current approaches to motif finding typicallyconsist of two steps:1. Identification of sets of co-regulated genes

based on their expression patterns, usually by clustering

2. Searching for overrepresented sequence motifs in the upstream sequences of each set of related genes by Gibbs sampling or expectation maximization

Cluster I

Cluster II

Cluster III

Gene expressionmicroarrays

Clustering

Location analysis(ChIP-chip, …)

Functional group(e.g., GO term)

Motif discovery: The two-step pipeline

Promoter/3’UTRsequences

Motifdiscovery

Co-regulated gene set

• The new algorithm, Inspector, integrates upstream sequence and expression data to find co-expressed genes with a sequence motif that is specific to that group of genes.

• Inspector addresses two limitations of the current approach:

1. An integrated model reduces the effect of noise in expression data.

2. Optimizing for specificity prevents the identification of ubiquitous sequence motifs as determinants of expression.

Algorithm• Inspector is an iterative Gibbs sampling algorithm,

with the objective function being the specificity of the sequence motif to the genes in the current search set, i.e. having similar expression profiles.

• Given N total sequences, s1 of which have the motif, s2 of which are similarly expressed, and x of which are in the intersection of these sets.

• The specificity score is the hyper-geometric tail, i.e. the probability that at least x genes of s1 are in the intersection.

S1 S2≥ x

N

• The integrated model has two components:1. The sequence model, a position weight matrix derived

from candidate motif instances

2. The expression model, the mean expression profile of the genes currently in the model

• Sequence and expression thresholds are adjusted at regular intervals to minimize the specificity score.

8.4 7.2 3.2 45 12 75 98 2.3 1.2 3.4 3.9 1.5 1.3 2.7 5.6

6 5 4 3 2 10 0.2 0.7 0 0.8 0.1 A0.6 0.4 0.1 0.5 0.1 0 C0.1 0.4 0.1 0.5 0 0 G0.3 0.1 0 0.1 0.9 T

The integrated model, which is composed of a sequence component and an expression component, is iteratively refined to maximize the objective function, specificity of the motif.

• Initialization: a random gene and position is picked and this k-mer is the initial PWM and the expression profile of the gene.

• Several initialization values are tried.• The process halts when the specificity is no longer

improved.• The new model in each iteration is the average of the

PWM and the expression profile.• For expression profile similarity, they use Pearson

Correlation Coefficient:

The PWM match scoresPWM match score is taken from AlignAce (Hughes J. et al. 98), only the background model is 3rd-5th order Markov Model (instead of 0th).The score S for a site Q whose sequence as a

function of position is given by q(p):

AAAACCGTTCAGTCAGGTCATAGCAnd matrix M (next slide):

≈ log ∏(frequency of q(p) in the PWM)

Fp,b is the number of bases of type b aligned at position p, N is the number of aligned sites, and pb is the genomic background nucleotide frequency for base b.

The first term corresponds to the log of the frequency of a given base at a particular position in the motif alignment, estimated with a Bayesian prior distribution corresponding to the genomic mononucelotide frequencies and a total pseudocount of 1.

6 5 4 3 2 10 0.2 0.7 0 0.8 0.1 A0.6 0.4 0.1 0.5 0.1 0 C0.1 0.4 0.1 0.5 0 0 G0.3 0.1 0 0.1 0.9 T

= Fp,b / N

Synthetic Datasets used• The synthetic sequence dataset consisted of 5000

sequences divided into 80 sets of varying size.• All the sequences in a set were seeded with one

common functional motif and four ubiquitous motifs (picked randomly from 20 false motifs), with non-motif sequence having the same nucleotide frequencies as yeast intergenic sequence.

ACGTCAGTGCGATACGATGCTGAGCCTGGAAAAACCGTTCAGTCAGGTCATAGC

Pool of false

motifs

real motif

• Every set of genes was assigned a mean expression profile across 50 conditions, corresponding to regulatory control by one functional motif.

• Each gene in a set was then assigned an expression profile around this mean profile with Gaussian noise.8 7 3 45 12 75 98 2 1 3 3 1 3 2 5

+Additive Gaussian Noise

= 8.4 7.2 3.2 45 12 75 98 2.3 1.2 3.4 3.9 1.5 1.3 2.7 5.6

Results•Inspector performs better at detecting motifs in synthetic sequence and expression datasets than the combination of k-means clustering and AlignACE. •The sequence dataset was created to mimic the basepair composition and length of yeast intergenic sequence, while the expression data matches pairwise correlation characteristics of real yeast expression datasets.

•1-Specificity = FALSE NEGATIVE•Sensitivity = TRUE POSITIVE

Real Datasets Used• Saccharomyces cereviseae datasets• The sequence dataset was the upstream sequence for all yeast

ORFs.• The expression dataset was a combination of three different

original datasets (Brauer08, Gasch00, Spellnab98) and profiled all yeast ORFs over 292 conditions.

• Caenorhabditis elegans datasets• The sequence dataset consisted of up to 2kb of upstream

sequence for 5691 genes.• The expression dataset was the same as that used by Beer and

Tavazoie (2004). It contains 255 conditions.

•Inspector detects more known motifs than the combination of k-means clustering and AlignACE.

•There were 97 known motifs in total (Harbison 2004). A CompareACE score of 0.75 or greater was considered a match. ChIP target sets (Harbison04) were considered a match if the hypergeometric p-value for overlap was less than 10-7.

The first is a known motif.The two others are new motifs in C. elegans, which arecandidates for experimental validation.

• Inference of binding specificity from protein binding domains (work from the group of Tim Hughes at University of Toronto, presented by Matt Weirauch)

• This is an ambitious study to infer binding specificity of TFs in eukaryotes using protein domain similarity.

• It is well known that similar TFs (i.e. from the same TF family) have similar binding sites and binding specificities.

• The goal is to infer the binding specificity according to the binding domain of the TF and its similarity to other TFs whose binding motifs are known.

Studies in the field of motif finding

• Their aims are three-fold:1. Use PBM data to refine and test rules for inference of

TF sequence specificity.2. Generate the data needed to produce accurate

“Pfam-wide” inferences of sequences specificity for as many eukaryotic DNA-binding domain classes as possible.

3. Construct a DB to house both known and inferred sequence preferences for eukaryotes with available genomic sequences.

• Discriminative motif finding (work from the group of Ziv Bar Joseph at CMU, presented by Shan Zhong)

• The algorithm looks for a motif that best discriminates between a positive set of sequences and a negative set (in which sequences are supposed not to contain the motif).

• They use a generative mixture model for k-mer distributions that can be viewed as 0th-order HMM.

• The user specifies a motif length k.

• The method extracts all k-mers from the positive and negative sequences, and then searches for a position weight matrix that maximizes a discriminative target function.

• This function represents the difference in the expected number of times that the mixture component was used in the HMM to generate the positive and negative sequences.

• The running time is independent of the input sequence size (depends only on k).

• Motif finding in mRNA's UTR region (work from the group of Tim Hughes at University of Toronto, presented by Quaid Morris)

• The main contribution here is that a motif is not only represented by its sequence, but by its structural parameters as well.

• RNA has specific structure, which affects the protein's binding to it. The novelty here is that those structural features are incorporated in the model that represents the motif.

• MalaRKey is a new motif finding method that uses a feature-based product model to represent RBP binding affinity for a given site.

• The structural features are:1. The site is in a hairpin loop.2. 1st base in site paired and the rest in hairpin loop.3. Tendency of particular subsequences to share the

same secondary structure context.• A nice result they showed is that when the motif

sequence is of length 4 there is preference to binding to specific RNA structures, and as the motif length increases, the preference decreases and by length 7 there's almost no structural preference. This means that some of the information is encoded in the structure together with the sequence.

Documents

Conference Report: Recomb Satellite NYC, Nov 2010