Transcription factor binding motifs (part I) 10/17/07

Transcription factor binding motifs (part I)

10/17/07

Steps of gene transcription

activator

Pol II Pol II

The term “transcription factor” (TF) usually means an activator or repressor.

Understand Regulation

• Which TFs are involved in the regulation?

• Does a TF enhance / repress gene expression?

• Which genes are regulated by this TF?

• Are there binding partner / competitor for the TF?

• Why disease when a TF went wrong?

Understand Regulation

• Which TFs are involved in the regulation?

• Does a TF enhance / repress gene expression?

• Which genes are regulated by this TF?

• Are there binding partner / competitor for the TF?

• Why disease when a TF went wrong?

Sequence specificity of TF binding

Motif representation

• Consensus: GCGAA

• PWM

Alignment matrix

• PWM

frequency matrix

• PWM

• Logo

Objectives of motif finding

• Known motif mapping– Given a known motif, find all the matches over

a query sequence.

• De novo motif discovery– Both motif patterns and match positions are

unknown– much harder

Known Motif Mapping

• The matching score for a new sequence x is given by

wherem is the entries in the frequency matrix

is the background model: p0(A), …, p0(T), or can be

third-order Markov model (see next slide).

• Calculate the matching score for all genomic sequences.

Motif sites correspond to highest scores.

) model background | Pr(

) model motif | Pr(log

)|Pr(log 2

xim ipx ,)|Pr(

TGCAjwiijm p ,,,;,,1)(

Third-order Markov model

• The probability of generating a new base is dependent on the previous three bases.

3rd order Markov dependencyp( )

CTTAPATGTAP

De novo motif discovery

• Statistical approach– Identify sequence patterns that occur more frequently

than random.– Target regions:

• Promoters regions of co-regulated genes• Promoters regions of differentially expressed genes• Experimentally identified TF binding sites

– Very common

• Biophysical approach– Calculate protein-DNA binding affinities from first

principles.– See Roider et al. 2006 for an example.

Methods

• PWM modeling– MEME, GMS, AlignACE, BioProspector

• Word enumeration– YMF, MDScan

• Use negative control– REDUCE, Motif Regressor

• Comparative genomic– MCS, ComparProspector, Phylocon

• CHIP-chip (will discuss later)

The challenges

no motif sites

The challenges

multiple motif sites

The challenges

variable relative positions

The challenges

variable sequence pattern

(Bailey and Elkan 1994)

• Input– A set of sequences: Y = {Yi}

– For a fixed length w, partition Y into overlapping w-mers: X = {Xi}

– A set of alphabets: A = {aj} = {A,C,G,T}

• Mixture Model

m Motif model:

0 Background model: 0th or 3rd Markov

TGCAjwiijm p ,,,;,...,1)(

0)1(~ mX

• Missing data: Z = { Zi }

• The log-likelihood is

• Select and to maximize the log-likelihood, but how?

Log-likelihood

Expectation-Maximization (EM)

• Iteratively update hidden states and parameter values. Commonly used in bioinformatics research.

• E-step:– Under current estimate of , , and the observed

data, evaluate the expected value of log-likelihood over the values of the missing data Z.

Expectation Maximization (EM)

• M-step:– Update the parameters so that expected log-

likelihood is maximized.

Iterative E- and M- steps until convergence

Issue with EM algorithm

• Can get trapped into local minimum

• Results depend on initial guess

• Often need to do multiple runs starting with difference initial guesses. Then pick the best one.

Gibbs sampling

• Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables

• Gibbs sampling is applicable when the joint distribution is not known explicitly, but the conditional distribution of each variable is known.

• The sequence of samples comprises a Markov Chain.

• As the iteration number goes to infinity, the asymptotic distribution approaches the underlying joint distribution.

Key differences between EM and Gibbs sampling

EM Gibbs Sampling

Maximum likelihood Posterior

Deterministic Stochastic

Frequenist Bayesian

Initialize seed for Initialize prior for

Gibbs Motif Sampler

(Lawrence et al. 1993; Liu et al. 1995)

Assume each sequence contains one motif. But the position and the motif frequency matrix are unknown.

Gibbs Motif Sampler

1 Without11 Segment

• Take out one sequence with its sites from current motifTake out one sequence with its sites from current motif

Segment (2-7): 3

Segment Scores of Sequence 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Sequence 1

Gibbs Motif Sampler• Score each possible segment of this sequenceScore each possible segment of this sequence

1 Without11 Segment

Segment Scores of Sequence 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Starting Position of Segment

Modified 1

Gibbs Motif Sampler• Sample a new segment to put the sequence backSample a new segment to put the sequence back

Advantage of Gibbs sampling

• Stochastic sampling permits the algorithm to escape from local minima. More robust than determinstic sampling as in EM.

• Fast.

Transcription level changes in glucose vs galactose

(Roth 1998)

MDscan

(Liu et al. 2002)• Basic idea

– True targets are likely to be more differentially expressed than other genes.

• Procedure:– Rank genes according to p-values, gene expression

levels, etc. – Search TF motif from highest ranking targets first

(high signal / background ratio)– Refine candidate motifs with all targets

Similarity defined by m-match

For a given w-mer and any other random w-mer

TGTAACGT 8-mer

TGTAACGT matched 8

AGTAACGT matched 7

TGCAACAT matched 6

TGACACGG matched 5

AATAACAG matched 4

m-matches for TGTAACGT

Pick a reasonable m to call two w-mers similar

MDscan Algorithm:Finding candidate motifs

Seed1 m-matches

MDscan Algorithm:Finding candidate motifs

Seed2 m-matches

• Maximum a posteriori (MAP) score function:

• Prefer: conserved motifs with many sites, but are not often seen in the genome background

• Keep best 30-50 candidate motifs

MDscan Algorithm:Scoring candidate motifs

Motif Signal Abundant

PositionsConserved

Specific (unlikely in genome background)

MDscan Algorithm:Update motifs with remaining seqs

Seed1 m-matches

MDscan Algorithm:Refine the motifs

MDscan Algorithm

• Check high signal/background ratio sequences first, more likely to find the correct motif

• Algorithm summary:– Seed with w-mer in top, find m-match to make matrix– Keep good motifs to be update by remaining

sequences– Refine motifs by removing bad sites

• Can check motif of any width very fast– Only consider existing w-mers, finite dataset– Seed in top sequences O(n2)– Update motifs with all sequences O(n)

Word enumeration

YMF (Sinha and Tompa 2002)• Search in ALL possible w-mers. For each w-mer,

calculate a z-score measuring whether it is over-represented in the selected sequences vs the background.

• Rank the words by the z-score.• Select the top ones.

Advantage:• Global optimum

Drawback:• Computational time grows exponentially with w, so can

only be used to search short motifs. 6~10 mer.

Transcription factor binding motifs (part I) 10/17/07

Documents

Transcription Factor Binding Motifs, Chromosome mapping and Gene Ontology analysis on Cross-platform microarray data from bladder cancer. Apostolos Zaravinos

Network motifs in the transcriptional regulation network ... · transcriptional regulation. The transcription factors controlling SIM motifs are usually autoregulatory (70%, mostly

Identification of Transcription Factor Binding Sites

Critical Roles of Phosphorylation and Actin Binding Motifs, but Not

Modeling Motifs Collecting Data · TFBSshape: a motif database for DNA shape features of transcription factor binding sites.Nucleic Acids Res. 2014 42(Database issue):D148-55. Quantitative

THE JOURNAL OF Vol. 269, No. 3, Issue 21, 1804-1814, 1994 ... · and dCRE motifs. The CGTCA motif-binding factors were CAMP response element binding protein (CREB)/ activating transcription

Transcription Factor Binding Element Detection Using Functional …rulai.cshl.edu/reprints/go_cluster_NAR.pdf · 2004. 4. 14. · Transcription Factor Binding Element Detection Using

Finding conserved transcription factor binding sites in promoter sequences NfkappaB motifs in promoters controling human NFkappaB gene family members Markella

Finding Transcription Factor Binding Sites

An insulinoma nuclear factor binding to GGGCCC motifs in human

Comprehensive structural classification of ligand binding motifs in

Eukaryotic Transcription factors: Transcription Activation ... lecture 6.pdf · DNA binding domain (DBD) Transcription Activation domain (TAD) Dimerization domain for binding to an

Location Analysis of Transcription Factor Binding

DNA binding factors - genetics.wustl.edugenetics.wustl.edu/...DNAbindingfactors_2020.pdf · • Specific protein and DNA binding • Transcription factor binding sites recognition

Package ‘RcisTarget’ - Bioconductor · 2020. 5. 8. · Package ‘RcisTarget’ February 4, 2021 Type Package Title RcisTarget: Identify transcription factor binding motifs enriched

Sequence-specific DNA Binding and Transcription Factor

Enhanced Maps of Transcription Factor Binding …CisBP motifs, derived from protein-binding micro-arrays, were on average shorter than motifs derived from DAP-Seq (average lengths

Specific And General Hla Dr Binding Motifs Comparison Of Algorithms

Stable Binding of the Conserved Transcription Factor

RESEARCH ARTICLE Open Access Variable structure motifs for ... · RESEARCH ARTICLE Open Access Variable structure motifs for transcription factor binding sites John E Reid1*, Kenneth