56
Detection and analysis of transcriptional control sequences Wyeth Wasserman October VanBUG Seminar Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia

Detection and analysis of transcriptional control sequences

  • Upload
    merrill

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

Detection and analysis of transcriptional control sequences. Wyeth Wasserman October VanBUG Seminar. Centre for Molecular Medicine and Therapeutics Children’s and Women’s Hospital University of British Columbia. Transcription Simplified. URF. Pol-II. URE. TATA. - PowerPoint PPT Presentation

Citation preview

Page 1: Detection and analysis of transcriptional control sequences

Detection and analysis of transcriptional control sequences

Wyeth Wasserman

October VanBUG Seminar

Centre for Molecular Medicine and TherapeuticsChildren’s and Women’s Hospital

University of British Columbia

Page 2: Detection and analysis of transcriptional control sequences

CMMT

Transcription Simplified

TATAURE

URF Pol-II

Page 3: Detection and analysis of transcriptional control sequences

CMMT

Overview of Transcription in Gene Regulation

• At the most basic level, transcriptional regulation is defined by binding of TFs to DNA

• Complexity is increased by TF interactions, chromatin structure and protein modifications

• How can we advance our understanding of regulation by computational analysis?

Page 4: Detection and analysis of transcriptional control sequences

A short history lesson…

Page 5: Detection and analysis of transcriptional control sequences

CMMT

Representing Binding Sites for a TF (HNF1)

• A set of sites represented as a consensus» VDRTWRWWSHDWVWH

A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

• A matrix describing a a set of sites

• A single HNF1 site» AAGTTAATGATTAAC

Page 6: Detection and analysis of transcriptional control sequences

CMMT

TGCTG = 0.9

PFMs to PWMs

One would like to add the following features to the model:1. Correcting for the base frequencies in DNA2. Weighting for the confidence (depth) in the pattern3. Convert to log-scale probability for easy arithmetic

A 5 0 1 0 0C 0 2 2 4 0G 0 3 1 0 4T 0 0 1 1 1

A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3T -1.7 -1.7 -0.2 -0.2 -0.2

f matrix w matrix

Log ( )f(b,i) +p(b)

4/N

Page 7: Detection and analysis of transcriptional control sequences

CMMT

Performance of Profiles

• 95% of predicted sites bound in vitro (Tronche 1997)

• MyoD binding sites predicted about once every 600 bp (Fickett 1995)

• The Futility Theorem– Nearly 100% of predicted transcription factor

binding sites have no function in vivo

Page 8: Detection and analysis of transcriptional control sequences

CMMT

A 1 kbp promoter screened with collection of TF profiles

Page 9: Detection and analysis of transcriptional control sequences

CMMT

Phylogenetic Footprinting

70,000,000 years of evolution reveals most regulatory regions.

Page 10: Detection and analysis of transcriptional control sequences

CMMT

Phylogenetic Footprinting to Identify Functional Segments

% Id

en

tity

Actin gene compared between human and mouse with DPB.

200 bp Window Start Position (human sequence)

Page 11: Detection and analysis of transcriptional control sequences

CMMT

Regulatory sites are usually conserved between orthologous genes

HUMAN ACGATACGCATCACAGACT.ACAGACTACGGCTAGCA -|-|||||||||-|---|--|||-------|-|---|MOUSE GCAATACGCATCGCGATCAGACATCAGCACG.TGTGA

HUMAN ACATCAGCATACACGCAACTACACAGACTACGACTA ---|||||-||||---|-|----||-||-||||---MOUSE CGTTCAGCTTACAGCTAGCATAGCATACGACGATAC

Page 12: Detection and analysis of transcriptional control sequences

CMMT

The 1kbp promoter screen with footprinting

Page 13: Detection and analysis of transcriptional control sequences

CMMT

Choosing the ”right” species...(BONUS: What’s the ultimate sin in bioinformatics?)

COW

MOUSE

CHICKEN

HUMAN

HUMAN

HUMAN

Page 14: Detection and analysis of transcriptional control sequences

CMMT

ConSite (www.phylofoot.org)

Page 15: Detection and analysis of transcriptional control sequences

CMMT

Performance: Human vs. Mouse

• Testing set: 40 experimentally defined sites in 15 well studied genes

• 85-95% of defined sites detected with conservation filter, while only 11-16%of total predictions retained

Page 16: Detection and analysis of transcriptional control sequences

CMMT

de novo Discovery of TF Binding Sites

Page 17: Detection and analysis of transcriptional control sequences

CMMT

Unraveling Transcriptional Control Mechanisms

Given a set of ”co-regulated” genes, define motifs over-represented in the regulatory regions

Page 18: Detection and analysis of transcriptional control sequences

CMMT

Pattern Detection Methods

• Exhaustive – e.g. “Moby Dick” (Bussemaker, Li & Siggia)

– Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

• Monte Carlo/Gibbs Sampling – e.g. AnnSpec (Workman & Stormo)

– Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

Page 19: Detection and analysis of transcriptional control sequences

CMMT

Yeast Regulatory Sequence Analysis (YRSA) system

Page 20: Detection and analysis of transcriptional control sequences

CMMT

Yeast tests of YRSA System

PDR3-regulated genes from array study

Classic cell-cycle array data re-clustered by Getz et al

DNA-damage responsepartially mediating by MCB

Page 21: Detection and analysis of transcriptional control sequences

CMMT

10

12

14

16

18

0 100 200 300 400 500 600

THE PROBLEM:

Pattern Detection in Long Sequences

SEQUENCE LENGTH

RANDOM SET

MEF2 SET

ME

F2 S

IMIL

AR

ITY S

CO

RE

Page 22: Detection and analysis of transcriptional control sequences

CMMT

Four Approaches to Extend Sensitivity

• Phylogenetic Footprinting– Human-Mouse eliminates ~75% of sequence

• Better background models– e.g. AnnSpec

• Better definition of co-regulation– Microarrays occasionally produce noise

• Use biochemical knowledge about TFs– TFBS patterns are NOT random

Page 23: Detection and analysis of transcriptional control sequences

CMMT

Some characteristics have been explored…• Segmentation: informative positions separated

by variable positions (proteins bind as dimers)• Positional Variance: subset of positions

contain most of the info• Palindromes are common in the patterns

Page 24: Detection and analysis of transcriptional control sequences

CMMT

Our Hypothesis

• Point 1: Structurally-related DNA binding domains interact with similar target sequences

• Exceptions exist (e.g. Zn-fingers)

• Point 2: There are a finite number of binding domains used in human TFs

• Approximately 20-25

• Idea: We could use the shared binding properties for each family to focus pattern detection methods

• Constrain the range of patterns sought

Page 25: Detection and analysis of transcriptional control sequences

CMMT

Comparison of profiles requires alignment and a scoring function

• Scoring function based on sum of squared differences

• Align frequency matrices with modified Needleman-Wunsch algorithm

• Calculate empirical p-values based on simulated set of matrices

Score

Fre

que

ncy

Page 26: Detection and analysis of transcriptional control sequences

CMMT

Prediction of TF Class

TF Database(JASPAR)

COMPARE

Match to bHLH

Jackknife Test 87% correct

Independent Test Set 93% correct

Page 27: Detection and analysis of transcriptional control sequences

CMMT

Page 28: Detection and analysis of transcriptional control sequences

CMMT

FBPs enhance sensitivity of pattern detection

Page 29: Detection and analysis of transcriptional control sequences
Page 30: Detection and analysis of transcriptional control sequences

CMMT

APPLICATION:

Cancer Protection Response

• Detoxification-related enzymes are induced by compounds present in Broccoli

• Arrays, SSH and hard work have defined a set of responsive genes

• A known element mediates the response (Antioxidant Responsive Element)

• Controversy over the type of mediating leucine zipper TF

• NF-E2/Maf or Jun/Fos

Page 31: Detection and analysis of transcriptional control sequences

CMMT

Gibbs Sampling

Application (2)

Problem: Given a set of co-regulated genes, determine the common TFBS. Classify the mediating TF. We expect a leucine zipper-type TF.

Gibbs with FBP PriorClassify New TF Motif

Maf (p<0.02)

Jun (p<0.98)

Page 32: Detection and analysis of transcriptional control sequences

CMMT

Regulatory Modules

TFs do NOT act in isolation

Page 33: Detection and analysis of transcriptional control sequences

Layers of Complexity in Metazoan Transcription

Chromatin picture used with permission of Zymogenetics.

Page 34: Detection and analysis of transcriptional control sequences

CMMT

Liver Differentiation (data mostly from studies of hepatocytes)

CEBP HNF3 HNF1HNF4

Stem Early Fetal Mature

Page 35: Detection and analysis of transcriptional control sequences

CMMT

Liver regulatory modules

Page 36: Detection and analysis of transcriptional control sequences

CMMT

Models for Liver TFs…(Data that takes 2 months to produce and 10 seconds to present) (Or, what to do with an astrophysicist new to bioinformatics)

HNF1

C/EBP

HNF3

HNF4

Page 37: Detection and analysis of transcriptional control sequences

CMMT

Training predictive models for modules

• Limited by small size of positive training set

• We elected to use logistic regression analysis for the first models

• Your favorite statistical approach would probably do equally well– data limited

Page 38: Detection and analysis of transcriptional control sequences

CMMT

Logistic Regression Analysis

“logit”

Optimize vector to maximize the distance between output values for positive and negative training data.

Output value is:

elogit

p(x)= 1 + elogit

Page 39: Detection and analysis of transcriptional control sequences

CMMT

UDPGT1 (Gilbert’s Syndrome)

WildtypeMutant

Live

r M

odul

e M

odel

Sco

re

“Window” Position in Sequence

Page 40: Detection and analysis of transcriptional control sequences

CMMT

PERFORMANCE

• Liver (Genome Research, 2001)

– At 1 hit per 35 kbp, identifies 60% of modules– Limited to genes expressed late in liver

development

• Skeletal Muscle (JMB, 1998)

– Set to 1 prediction per 35 000 bp– Identifies 66% of test set correctly

LRA Models do not account for multiple sites for the same TF*

* Side-track: Newer Methods

Page 41: Detection and analysis of transcriptional control sequences

CMMT

Combining Phylogenetic Footprinting with a Module Model

Page 42: Detection and analysis of transcriptional control sequences

CMMT

Genome Scan

• Screened the available mouse genomic sequences (~300 MB) for modules and discarded hits for which sequence was not conserved with human (BLAST)

• Removed regions for which corresponding human sequence did not score as module

• Of ~100 predicted modules• 20 annotated genes: 5 from training, 3 additional

modules, 5 liver specific, 3 unknown and 4 not liver

Page 43: Detection and analysis of transcriptional control sequences

CMMT

de novo Discovery of Regulatory Modules

Page 44: Detection and analysis of transcriptional control sequences

CMMT

Focus on regulatory modules for pattern detection

Cluster Genes by Expression

Identify and ModelContributing TFs

6 0 0 0 7 0 02 8 4 7 1 0 20 0 4 0 0 8 00 0 0 1 0 0 6

Predictive Models

Page 45: Detection and analysis of transcriptional control sequences

CMMT

Finding binding sites in sets of co-regulated human genes

• Sequence “space” is too large– Narrow with Phylogenetic Footprinting

• Identify patterns in conserved blocks via Gibbs sampling

• Assess quality of patterns based on biological knowledge

Page 46: Detection and analysis of transcriptional control sequences

CMMT

Phylogenetic Footprinting to Identify Conserved Regions

Page 47: Detection and analysis of transcriptional control sequences

CMMT

Skeletal Muscle Genes

• One of the most extensively studied tissues for transcriptional regulation– 45 genes partially analyzed

– 26 genes with orthologous genomic sequence from human and rodent

• Five primary classes of transcription factors– Principal: Myf (myoD), Mef2, SRF

– Secondary: Sp1 (G/C rich patches), Tef (subset of skeletal muscle types)

Page 48: Detection and analysis of transcriptional control sequences

CMMT

Regulatory regions directing muscle-specific transcription

MyoD/Myf SRF

Mef2 Tef

Page 49: Detection and analysis of transcriptional control sequences

CMMT

de novo Discovery of Skeletal Muscle Transcription Factor Binding Sites

Mef2-Like SRF-Like Myf-Like

Page 50: Detection and analysis of transcriptional control sequences

CMMT

We will soon be able to define modules for many contexts…

Page 51: Detection and analysis of transcriptional control sequences

CMMT

A gene-centric data integration project...

Page 52: Detection and analysis of transcriptional control sequences

CMMT

COMING SOON:

The Integrated Module Sampler

Gene1Gene2Gene3Gene4Gene5

Calls to ensEMBL

Calls to GeneLynx

Calls to BlastZ(Switch to Lagan?)

Module Sampler

Page 53: Detection and analysis of transcriptional control sequences

CMMT

YOU SHOULD HAVE BEEN THERE… THIS

SLIDE EXCLUDED FROM THE POSTED FILE

Page 54: Detection and analysis of transcriptional control sequences

CMMT

Conclusions

• Evolution drives understanding in biology– Phylogenetic Footprinting

• Biochemistry inspires Bioinformatics– Regulatory Modules– Familial Binding Profiles

• Analysis of regulatory sequences is improving– Given sets of orthologous genes, one can predict regulatory regions– Given sets of co-regulated genes, it is possible to infer the binding

profiles for critical transcription factors

• Much more work is needed…

Page 55: Detection and analysis of transcriptional control sequences

THANKS!Wasserman Group – CMMT

Danielle KemmerSeveral Newcomers

Wasserman Group - SwedenAlbin Sandelin

Raf Podowski (CA)Wynand Alkema

Collaborating StudentsMalin Andersson (Odeberg)

Öjvind Johansson (Lagergren)Hui Gao (Dahlman-Wright)

Emily Hodges (Höög)

Support: Merck-Frosst, C&W, Pharmacia, EU–Marie Curie, CGDN, KI-Funder

CollaboratorsChip Lawrence (Wadsworth)

Boris Lenhard (K.I.)Jens Lagergren (SBC)

Christer Höög (K.I.)Brenda Gallie (OCI)

Jacob Odeberg (KTH)Niclas Jareborg (AZ)William Hayes (AZ)

Group AlumniPer Engström Elena Herzog

Annette HöglundWilliam KrivanBoris LenhardLuis Mendoza

Page 56: Detection and analysis of transcriptional control sequences

CMMT

URLs...

• Group: www.cmmt.ubc.ca

• ConSite/DPB: www.phylofoot.org

• GeneLynx: www.genelynx.org