TESS-II

TESS-II

Describing and Finding Gene Regulatory Sequences with

Grammars.

Jonathan Schug and Christian J. Stoeckert, Jr.Center for Bioinformatics at the University of Pennsylvania.

jschug,[email protected]

Background:

Example 1: Using Bounds and Annotation

Example 2 : Using Collection Productions

Integration with GUS:

Conclusion:

Problem: Too many binding sites

• The typical search with TESS yields about 1 binding site start per base and a good match every few bases!

• Which ones are active?

Combining Signals and Sites

• Want to consider:– Pairs or triples (or more) of binding sites,– Sites in particular contexts, e.g., CpG

islands, introns, regions of homology,– Physico-chemical properties of DNA

sequence

Why Grammars?

• Grammars provide a means of describing complex reusable structures and combining them in novel ways.

• Analysis of some promoters suggests a modular design.

• Grammars are a way of moving beyond dimers to structured descriptions of promoters, enhancers, and entire genes.

Grammars

• Alphabet– finite set of letter symbols

• String– a finite sequence of letters from an alphabet.

• Language – a set of strings over an alphabet.

• Grammar– a formal description of a language.

Using the alphabet of DNA, we want to find a grammar for the language of promoters of genes expressed in liver.

Example Grammar

• A simple example for small portion of English.• A context-free grammar has four components:

– Alphabet = {a,b,c,…,z,space}– Symbols = {S,Vp,Av,V,Np,D,Aj,N}– Start symbol = S– ProductionsS -> Np Vp;Np -> D Aj N;Vp -> Av V Np | Av V;D -> a | the | his | her;Aj -> happy | silver | new | ;N -> boy | girl | bicycle | guitar;Av -> slowly | quietly | ;V -> rides | strums;

Derivation Trees

The happy boy slowly rides a silver bicycle

S

VpNp

D Aj N

Av V Np

D Aj N

• Record productions use in generation/parsing of string

• Yield a structured description of the string.

Parser Data Flow

Parser

GrammarRules

GUS

DAS

Flatfile

(FASTA)

MainStream

OtherStreams

Matches(XML)

GUS DASProcessingMain Stream

1

2

3

4 5

1. Parser loads grammar from flat file.

2. Main sequence is extracted from a file, GUS database, or DAS server via a plugin.

3. Sequence is handed to secondary plugins to populate their streams.

4. Grammar rules are applied

5. Matching instances are output in XML.

Data Streams

Main character stream:

e.g. DNA or AA

Annotation stream:

Genes, Repeats, Homology, etc.

Real value stream:

e.g. CG-content

acgtagtccgcgcgagcgttagcgagataggcagaatatagca

0.31 0.33 0.41 0.64 0.60 0.58 0.51 0.44 0.38

Gene

3’UTR5’UTR Exon Intron

Patterns can be specified in terms of characters, real values, or structured annotation.

Transcript

Annotation Objects

• Attributes for stream annotation objects.

• All can be used to select annotation in grammar.

StreamName::NameValue[name rel value,…] for example Gene::Intron[index=1] selects first intron.

Bounding the Size of Matches

A -(P.R)-> B C D; /* bounded subparse */

P ---> Q R S; /* defines context */

acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc

A

B C D

P

Q R S

Expansion of a production is bounded by some interval, either another production or some annotation.

What Genes Are Regulated by CREB?Collaboration with M. Mackiewizc and A. Pack Center

for Sleep and Respiratory Neurobiology UPenn.

• CREB (cyclic AMP response element binding protein) • Binds as dimer to CRE site with full consensus TGACGTCA or

half site CGTCA.• Member of family including CREB, CREM, and ATF-1• CREB and CREM have several isoforms which may be

(conditional) activators, inhibitors, or inducible repressors.• Inducible by a wide variety of signals including cAMP, Ca,

growth factors, hypoxia, survival signals, and UV light.• Mutations affect a variety of cell types including circadian

rhythm and learning defects.• Thought to bind close to TSS because it interacts with

transcription complex via CBP.

Activation of CREB

(Nature Reviews: Molecular Cell Biology August 2001)

CREB Family Members

TESS Weight Matrix for CREB

Information content of weight matrix.

Accuracy curves are used to pick thresholds

with estimated sensitivity and specificity.

How are CRE sites distributed?

• Preliminary question to help identify constraints on active sites.

• Use grammar to query RefSeq transcripts aligned to UCSC GoldenPath release 2 of mouse genome and look for consensus sites or very good weight matrix hits.

CREB Grammar/* rules for various qualities of binding sites in an upstream region */ConsensusCrebs -[Genes::RefGeneGroup]-> tgacgtca;VeryGoodCrebs -[Genes::RefGeneGroup]-> TF::CREB[Score<2.1];GoodCrebs -[Genes::RefGeneGroup[-> TF::CREB[Score<3.591];MostCrebs -[Genes::RefGeneGroup]-> TF::CREB[Score<5.891];

/* Put RefSeq alignments from DAS server in Genes stream. */Genes <--- DAS --Types='refGene’ --Anchor='upstream’ --UpstreamFlank=-2500 --DownstreamFlank=500;

/* Put weight matrix predictions on the fly into TF stream. */TF <--- WMS --LdMax=6 --File='Data/TESS/WMS/all.wm';

FlatPat --Grammar Creb.fp \ --DasServer http://genome.cse.ucsc.edu/cgi-bin/das/hg11 \ > creb.xml

Run parser with this command:

http://genome.cse.ucsc.edu/cgi-bin/das/hg11%20/

Distribution of CREB Sites

Distribution of CREB Sites

• Is not explained by 1- or 2-order Markov models or gaps in sequence.

• Confirms and extends initial knowledge.

• Similar results obtained in human.

• Suggests positional cutoff and confidence in predicted sites.

• Suggests total number of regulated genes at 1000-1500 (based on human data).

Why Are Genes Are Expressed in Muscle?

• At least five factors are known to be relevant, Myf(MyoD), MEF2, Sp1, SRF, and TEF.

• Data from Wasserman and Fickett (1998) indicates that not all factors are required simultaneously.

• Consider patterns of subsets of sites.• First check overall distribution of sites in RefSeq upstream

regions.• Examined predictive ability of two collections of factors:

– SRF, Sp1, and Myf– TEF and Sp1

TEF and Sp1

• TEF is not restricted to muscle

• Binds to GGAATC consensus.

• Used weight matrix built with Wasserman and Fickett data.

• Occurs near Sp1often in training set.

MuscleF_Set -{50,GoldenPath::RefGeneGroup}-> TF::TEF,TF::SP1;

Distribution of Muscle Factor Sites

SRF, Sp1, and MyoDHave been shown in to interact in human cardiac alpha actin gene centered at about -50bp and spanning 66bp (Biesiada et al MolCelBio 19(4) 1999).

/* expanded length, no spacing constraints, same orientation */

Hca_List -[150,GoldenPath::RefGeneGroup]-> TF::SRF[Sense = 1], TF::SP1[Sense = 1], TF::MYF[Sense = 1];

/* expanded length, no spacing constraints, orientation unconstrained*/

Hca_List_Unoriented -[150,GoldenPath::RefGeneGroup]-> TF::SRF, TF::SP1, TF::MYF;

/* expanded length, order and orientation not constrained */

Hca_Set -{150,GoldenPath::RefGeneGroup}-> TF::SRF, TF::SP1, TF::MYF;

SRF Sp1 MyoD

Production Terms• Literal match

– lowercase letters or quoted text– e.g. gata or “gata”

• Gap– A period with optional length bounds– e.g. ., .#5, .#{20,30}

• Annotation– Stream::Name[attr rel value, …]– e.g. BindingSites::Sp1[score<2.1,sense=+1]

• Numeric Comparison– Stream:: rel value– e.g. CG:: > 0.6

• Position– @position or @@position for relative or absolute position– e.g. @1 for start of bounding interval.

List Productions• Gaps between terms are implied.• Numeric or term bounds keep from expanding.


A

B C D

A -()-> B, C, D;P -[]-> Q, R, S;

P

Q R S

Set Productions• Gaps between terms are implied.• Terms must appear at least once but order is not specified.• Numeric or term bounds keep from expanding.


P

Q R S

P-[]-> Q, R, S;P -{}-> Q, R, S;

P

R S QQ

P

Bag Productions• Gaps between terms are implied.• Terms must appear at least as many times as specified but order is not specified.• Numeric or term bounds keep from expanding.


P

Q R S

P-<>-> Q:2, R:2, S:1;

Q R

GUS.TESS Schema

ModelString

ModelConsensusString

ModelPositionalWeightMatrix

ModelGrammar

TESS.Model

ActivityProteinDnaBinding

ActivityTissueSpecificity

TESS.Activity

Moiety

TESS.Moiety

MoietyMultimer

MoietyHeterodimer

MoietyComplex

TESS.FootprintInstance

DoTS.NaFeatureBindingSite

Promoter

. . .

DoTS.NaSequence

TESS.TrainingSet

TESS.ParameterGroup

TESS.Note

Sites, weight matrices, grammars, training data, and parses will be stored in GUS30. Will initialize with TRANSFAC and COMPEL.

Integration with GUS

• GUS contains schema and data for genomic sequence and RNA expression.

• Goal is to store models and instances of known and predicted regulatory regions for specific tissues.

Future Work

• Development and evaluation of more patterns.

• Experimental validation of predictions.

• Expansion of parser to recursive productions.

• Inclusion of comparative species analysis.

Related Posters

• 146A. The Genomics Unified Schema (GUS).

• 114A. Web-Based Biological Discovery using an Integrated Database.

• 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?

Documents

TESS-II