24
DESIGNING TRAINING DESIGNING TRAINING REGULATORY DATASETS REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó

DESIGNING TRAINING REGULATORY DATASETS Enrique Blanco Xavier Messeguer Roderic Guigó

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

DESIGNING TRAINING DESIGNING TRAINING REGULATORY DATASETSREGULATORY DATASETS

Enrique BlancoXavier Messeguer

Roderic Guigó

OUR APPROACHOUR APPROACH

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

1. SEQUENCE AND FUNCTION

SIMILARSEQUENCE

SIMILAR FUNCTION

Transthyretin NP_000362 (human) - NP_036813 (rat):

MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRGSPAINVAVMASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVRGSPAVDVAV*** **:*******:*.***** *:********************::***

HVFRKAADDTWEPFASGKTSESGELHGLTTEEEFVEGIYKVEIDTKSYWKKVFKRTADGSWEPFASGKTAESGELHGLTTDEKFTEGVYRVELDTKSYWK:**:::**.:*********:**********:*:*.**:*:**:*******

ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKEALGISPFHEYAEVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN*********:*********** *:******************:**::

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

2. FUNCTION AND SEQUENCE

SIMILARSEQUENCE

SIMILAR FUNCTION

ThiL gene (S. typhimurium) encoding thiamin phosphate kinase can be displaced (functionally equivalent) by THI80 (S. cerevisiae), encoding thiamin pyrophosphokinase.

Comparison of the known structure of THI80 with the structure of ThiL reveals different folds. Thus, two different folds might catalyze the same reaction.

Systematic discovery of analogous enzymes in Thiamin biosynthesis. Morett, Korbel, Rajan, Saab-Rincon, Olvera, Olvera, Schmidt, Snel, Bork. Nature Biotechnology 21, 790 - 795 (2003).

MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL--MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQKI.* .: :: :. :::.. :. .: * *:.** * .:.:

VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLPEVDEPWDIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-LPNYIIGD. :: .* . . . * * * : **:

LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYIPAGRALKRSGLDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISLHFNSPEFRSLI*:::*:.:: . .: :* * . :: :.: . . ::

AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEETDAHYLIQR----HLRSNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKVTPISLLALGGIGGRFD:: .: * :.. .: : * .*: * * ::

PTPRILHGQALRDIASAAIDLSDGLISDLGHIVKASGCGARVDVDALPKSQTVHSITQLYTLSENASYFKLCYMTPTDLIFLIKKNGTLIEYDPQFRNTC* : : . :: :.*. :** .::* .* . * : ..

DAMMRHVDDGQALRWALSGGEDYELCFTVPELNRGALDVAIGQLGVPFTCIGNCGLLPIGEATLVKETRGLKWDVKNWPTSVVTGRVSSSNRFVGDNCCF. : *:* : * .::: ..: * :. : :*

IGQMSADIEGLNFVRDGMPVTFDWKGYDHFATPIDTKDDIILNVEIFVDKLIDFL-----------*. . * .:::. * : :

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

3. FUNCTION AND SEQUENCE (TFBSs)

HNF1- binding sites (human)

------------AGTTAATCATTGGCC--------- -------------GTTAATTATTGGCAAATGTCCC- -------GTATGGGTTACTTATTCTCTCTTTGTTGA ------------GGTTAAGACTCTAAT--------- -------AGTCTAGTTAATAATCTACAATT------ ---------TGAGATTAATA---------------- ---------AATGATTAAAA---------------- -------------GTCAAACATTAAC---------- ----------CCGATTAACCATTAACCCCCACCCC- -------------GTTAATCAGAAAA---------- GGATGTATGTAGAATTACATAAGAA----------- -------------CTTACTCAATAAC----------

SIMILARSEQUENCE

SIMILAR FUNCTION

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

4. TF-MAPS: A NEW ALPHABET

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

5. TF-MAPS: A NEW FORM OF ALIGNMENT

MAP 1

MAP 2

We can align the TF-MAPS in this new alphabet:•Mapping score•Gaps•Positional conservation

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

6. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION

TTR gene: ENSG00000118271

Pairwise TF-map alignments between TTR and 83 COREG(TTR) in CISRED

A.G. Robertson et al. cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Research, 34:D68–D73, 2006.

TTR PROMOTER RECONSTRUCTION

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

7. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY

The HRCZ-set(36 genes)

SEQUENCE ALIGNMENT

VsTF-MAP

ALIGNMENT

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

8. RESULTS

TF-map alignments are a simple reflection of sequence conservation?

Genomic region TOP 1 Avg. Score

Coding 27 3706.72

5’UTR 4 2671.78

PROMOTER 4 2005.67

3’UTR 1 1994.22

Intronic 0 1267.89

Downstream 0 1174.28

5’Intergenic 0 1052.92

3’Intergenic 0 974.69

CLUSTALW

NO

TOP 1 Avg. Score

6 17.15

2 10.48

18 25.41

7 15.85

2 8.34

0 6.85

0 5.42

1 4.14

TF-MAP ALIGNMENT

DESIGN OF THE DESIGN OF THE DATASETDATASET

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

9. PAIRWISE TF-MAP ALIGNMENT TRAINING

TRAINING:To systematically estimate the parameters that are globally optimal, in terms of real TFBS detection, in a set of well-annotated promoter pairs

Predictions obtained with the database TRANSFAC: V. Matys et al. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34: D108 - D110 (2006)

Plots with the program gff2ps: J. F. Abril and R. Guigó. gff2ps: visualizing genomic annotations. Bioinformatics, 8:743–744 (2000)

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

10. ACCURACY TESTS

REAL pair of TFBS

H

H

M

M TF-MAP ALIGNMENT

Levels:• Nucleotide• Site

Measures:• Sensitivity [0,1]• Specificity (PPV) [0,1] • Correlation Coefficient [-1,1]

Coverage

A set of experimentally annotated promoters:• The promoter sequences (mapping)• Coordinates of the real TFBSs (alignment)• TFBSs present in both promoters (alignment)

Human/Mouse orthologous genes

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

11. SOURCES OF INFORMATION

• General Regulatory Repositories

• Publications:- The datasets of other programs- Individual experimental works

FORMATS / QUALITY / AVAILABILITY / STABILITY

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

12. ABS: ANNOTATED BINDING SITES

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

13. MY OWN EXPERIENCE (1)

MANUAL DATA CURATION:

* FINDING THE PROMOTER SEQUENCES IN THE GENOME:1. The original promoter entry does not exist (GenBank)2. The gene has another name3. The gene has not been annotated yet (RefSeq)4. The promoter sequence does not match the current TSS (RefSeq)5. The promoter sequence is not a promoter sequence (RefSeq)

* FINDING THE MOTIFS IN THE PROMOTER SEQUENCES:1. The binding motif is not in the original promoter sequence2. The motif is not in the coordinates that it was expected to be3. The motif has changed slightly (a few nucleotides)4. There are several motifs that could correspond to the real one5. The relative position among the motifs of the same gene is wrong

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

14. MY OWN EXPERIENCE (2)

* TF-MAPS AND ANNOTATIONS:1. The mapping function is not defined for a given TF2. The TFBS is not predicted by the mapping function in one of the

orthologs

* MATCHING THE ALIGNMENTS AND THE ANNOTATIONS:1. There are several mapping definitions that recognize the same

motif

NEW CHALLENGES: NEW CHALLENGES: DESIGN OF FUTURE DESIGN OF FUTURE

DATASETSDATASETS

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

15. NON-COLLINEAR CONSERVATION

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

16. SITES IN OTHER SPECIES

COLLAGENASE-3 GENE (MMP13) promoters kindly provided by Dr. López-Otín (Universidad de Oviedo)

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

17. ENCODE ChIP data

TRANSFAC:V$E2F1_Q3

10,000 bps

mouse

human

CONCLUSIONCONCLUSION

Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006

RESEARCH ON GENE REGULATION: DUAL PERSONALITY?

COMPUTER SCIENTISTEXPERT 1

BIOINFORMATICIANEXPERT 2

EXPERIMENTALIST?EXPERT 3

RESEARCHER

[email protected]@imim.es