36
TESS-II Describing and Finding Gene Regulatory Sequences with Grammars. Jonathan Schug and Christian J. Stoeckert, Jr. Center for Bioinformatics at the University of Pennsylvania . jschug,[email protected]

TESS-II

  • Upload
    lilike

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

TESS-II. Describing and Finding Gene Regulatory Sequences with Grammars. Jonathan Schug and Christian J. Stoeckert, Jr. Center for Bioinformatics at the University of Pennsylvania . jschug,[email protected]. Background:. Example 1: Using Bounds and Annotation. - PowerPoint PPT Presentation

Citation preview

Page 1: TESS-II

TESS-II

Describing and Finding Gene Regulatory Sequences with

Grammars.

Jonathan Schug and Christian J. Stoeckert, Jr.Center for Bioinformatics at the University of Pennsylvania.

jschug,[email protected]

Page 2: TESS-II

Background:

Page 3: TESS-II

Example 1: Using Bounds and Annotation

Page 4: TESS-II

Example 2 : Using Collection Productions

Page 5: TESS-II

Integration with GUS:

Page 6: TESS-II

Conclusion:

Page 7: TESS-II

Problem: Too many binding sites

• The typical search with TESS yields about 1 binding site start per base and a good match every few bases!

• Which ones are active?

Page 8: TESS-II

Combining Signals and Sites

• Want to consider:– Pairs or triples (or more) of binding sites,– Sites in particular contexts, e.g., CpG

islands, introns, regions of homology,– Physico-chemical properties of DNA

sequence

Page 9: TESS-II

Why Grammars?

• Grammars provide a means of describing complex reusable structures and combining them in novel ways.

• Analysis of some promoters suggests a modular design.

• Grammars are a way of moving beyond dimers to structured descriptions of promoters, enhancers, and entire genes.

Page 10: TESS-II

Grammars

• Alphabet– finite set of letter symbols

• String– a finite sequence of letters from an alphabet.

• Language – a set of strings over an alphabet.

• Grammar– a formal description of a language.

Using the alphabet of DNA, we want to find a grammar for the language of promoters of genes expressed in liver.

Page 11: TESS-II

Example Grammar

• A simple example for small portion of English.• A context-free grammar has four components:

– Alphabet = {a,b,c,…,z,space}– Symbols = {S,Vp,Av,V,Np,D,Aj,N}– Start symbol = S– ProductionsS -> Np Vp;Np -> D Aj N;Vp -> Av V Np | Av V;D -> a | the | his | her;Aj -> happy | silver | new | ;N -> boy | girl | bicycle | guitar;Av -> slowly | quietly | ;V -> rides | strums;

Page 12: TESS-II

Derivation Trees

The happy boy slowly rides a silver bicycle

S

VpNp

D Aj N

Av V Np

D Aj N

• Record productions use in generation/parsing of string

• Yield a structured description of the string.

Page 13: TESS-II

Parser Data Flow

Parser

GrammarRules

GUS

DAS

Flatfile

(FASTA)

MainStream

OtherStreams

Matches(XML)

GUS DASProcessingMain Stream

1

2

3

4 5

Page 14: TESS-II

1. Parser loads grammar from flat file.

2. Main sequence is extracted from a file, GUS database, or DAS server via a plugin.

3. Sequence is handed to secondary plugins to populate their streams.

4. Grammar rules are applied

5. Matching instances are output in XML.

Page 15: TESS-II

Data Streams

Main character stream:

e.g. DNA or AA

Annotation stream:

Genes, Repeats, Homology, etc.

Real value stream:

e.g. CG-content

acgtagtccgcgcgagcgttagcgagataggcagaatatagca

0.31 0.33 0.41 0.64 0.60 0.58 0.51 0.44 0.38

Gene

3’UTR5’UTR Exon Intron

Patterns can be specified in terms of characters, real values, or structured annotation.

Transcript

Page 16: TESS-II

Annotation Objects

• Attributes for stream annotation objects.

• All can be used to select annotation in grammar.

StreamName::NameValue[name rel value,…] for example Gene::Intron[index=1] selects first intron.

Page 17: TESS-II

Bounding the Size of Matches

A -(P.R)-> B C D; /* bounded subparse */

P ---> Q R S; /* defines context */

acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc

A

B C D

P

Q R S

Expansion of a production is bounded by some interval, either another production or some annotation.

Page 18: TESS-II

What Genes Are Regulated by CREB?Collaboration with M. Mackiewizc and A. Pack Center

for Sleep and Respiratory Neurobiology UPenn.

• CREB (cyclic AMP response element binding protein) • Binds as dimer to CRE site with full consensus TGACGTCA or

half site CGTCA.• Member of family including CREB, CREM, and ATF-1• CREB and CREM have several isoforms which may be

(conditional) activators, inhibitors, or inducible repressors.• Inducible by a wide variety of signals including cAMP, Ca,

growth factors, hypoxia, survival signals, and UV light.• Mutations affect a variety of cell types including circadian

rhythm and learning defects.• Thought to bind close to TSS because it interacts with

transcription complex via CBP.

Page 19: TESS-II

Activation of CREB

(Nature Reviews: Molecular Cell Biology August 2001)

CREB Family Members

Page 20: TESS-II

TESS Weight Matrix for CREB

Information content of weight matrix.

Accuracy curves are used to pick thresholds

with estimated sensitivity and specificity.

Page 21: TESS-II

How are CRE sites distributed?

• Preliminary question to help identify constraints on active sites.

• Use grammar to query RefSeq transcripts aligned to UCSC GoldenPath release 2 of mouse genome and look for consensus sites or very good weight matrix hits.

Page 22: TESS-II

CREB Grammar/* rules for various qualities of binding sites in an upstream region */ConsensusCrebs -[Genes::RefGeneGroup]-> tgacgtca;VeryGoodCrebs -[Genes::RefGeneGroup]-> TF::CREB[Score<2.1];GoodCrebs -[Genes::RefGeneGroup[-> TF::CREB[Score<3.591];MostCrebs -[Genes::RefGeneGroup]-> TF::CREB[Score<5.891];

/* Put RefSeq alignments from DAS server in Genes stream. */Genes <--- DAS --Types='refGene’ --Anchor='upstream’ --UpstreamFlank=-2500 --DownstreamFlank=500;

/* Put weight matrix predictions on the fly into TF stream. */TF <--- WMS --LdMax=6 --File='Data/TESS/WMS/all.wm';

FlatPat --Grammar Creb.fp \ --DasServer http://genome.cse.ucsc.edu/cgi-bin/das/hg11 \ > creb.xml

Run parser with this command:

Page 23: TESS-II

Distribution of CREB Sites

Page 24: TESS-II

Distribution of CREB Sites

• Is not explained by 1- or 2-order Markov models or gaps in sequence.

• Confirms and extends initial knowledge.

• Similar results obtained in human.

• Suggests positional cutoff and confidence in predicted sites.

• Suggests total number of regulated genes at 1000-1500 (based on human data).

Page 25: TESS-II

Why Are Genes Are Expressed in Muscle?

• At least five factors are known to be relevant, Myf(MyoD), MEF2, Sp1, SRF, and TEF.

• Data from Wasserman and Fickett (1998) indicates that not all factors are required simultaneously.

• Consider patterns of subsets of sites.• First check overall distribution of sites in RefSeq upstream

regions.• Examined predictive ability of two collections of factors:

– SRF, Sp1, and Myf– TEF and Sp1

Page 26: TESS-II

TEF and Sp1

• TEF is not restricted to muscle

• Binds to GGAATC consensus.

• Used weight matrix built with Wasserman and Fickett data.

• Occurs near Sp1often in training set.

MuscleF_Set -{50,GoldenPath::RefGeneGroup}-> TF::TEF,TF::SP1;

Page 27: TESS-II

Distribution of Muscle Factor Sites

Page 28: TESS-II

SRF, Sp1, and MyoDHave been shown in to interact in human cardiac alpha actin gene centered at about -50bp and spanning 66bp (Biesiada et al MolCelBio 19(4) 1999).

/* expanded length, no spacing constraints, same orientation */

Hca_List -[150,GoldenPath::RefGeneGroup]-> TF::SRF[Sense = 1], TF::SP1[Sense = 1], TF::MYF[Sense = 1];

/* expanded length, no spacing constraints, orientation unconstrained*/

Hca_List_Unoriented -[150,GoldenPath::RefGeneGroup]-> TF::SRF, TF::SP1, TF::MYF;

/* expanded length, order and orientation not constrained */

Hca_Set -{150,GoldenPath::RefGeneGroup}-> TF::SRF, TF::SP1, TF::MYF;

SRF Sp1 MyoD

Page 29: TESS-II

Production Terms• Literal match

– lowercase letters or quoted text– e.g. gata or “gata”

• Gap– A period with optional length bounds– e.g. ., .#5, .#{20,30}

• Annotation– Stream::Name[attr rel value, …]– e.g. BindingSites::Sp1[score<2.1,sense=+1]

• Numeric Comparison– Stream:: rel value– e.g. CG:: > 0.6

• Position– @position or @@position for relative or absolute position– e.g. @1 for start of bounding interval.

Page 30: TESS-II

List Productions• Gaps between terms are implied.• Numeric or term bounds keep from expanding.

acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc

A

B C D

A -()-> B, C, D;P -[]-> Q, R, S;

P

Q R S

Page 31: TESS-II

Set Productions• Gaps between terms are implied.• Terms must appear at least once but order is not specified.• Numeric or term bounds keep from expanding.

acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc

P

Q R S

P-[]-> Q, R, S;P -{}-> Q, R, S;

P

R S QQ

P

Page 32: TESS-II

Bag Productions• Gaps between terms are implied.• Terms must appear at least as many times as specified but order is not specified.• Numeric or term bounds keep from expanding.

acgtgcatgactagcatcagcatagcatcagcatcagcatgatcgagatc

P

Q R S

P-<>-> Q:2, R:2, S:1;

Q R

Page 33: TESS-II

GUS.TESS Schema

ModelString

ModelConsensusString

ModelPositionalWeightMatrix

ModelGrammar

TESS.Model

ActivityProteinDnaBinding

ActivityTissueSpecificity

TESS.Activity

Moiety

TESS.Moiety

MoietyMultimer

MoietyHeterodimer

MoietyComplex

TESS.FootprintInstance

DoTS.NaFeatureBindingSite

Promoter

. . .

DoTS.NaSequence

TESS.TrainingSet

TESS.ParameterGroup

TESS.Note

Sites, weight matrices, grammars, training data, and parses will be stored in GUS30. Will initialize with TRANSFAC and COMPEL.

Page 34: TESS-II

Integration with GUS

• GUS contains schema and data for genomic sequence and RNA expression.

• Goal is to store models and instances of known and predicted regulatory regions for specific tissues.

Page 35: TESS-II

Future Work

• Development and evaluation of more patterns.

• Experimental validation of predictions.

• Expansion of parser to recursive productions.

• Inclusion of comparative species analysis.

Page 36: TESS-II

Related Posters

• 146A. The Genomics Unified Schema (GUS).

• 114A. Web-Based Biological Discovery using an Integrated Database.

• 148A. Integrating Eukaryotic Genomes by Orthologous Groups: What is Unique about Apicomplexan Parasites?