View
218
Download
2
Tags:
Embed Size (px)
Citation preview
DESIGNING TRAINING DESIGNING TRAINING REGULATORY DATASETSREGULATORY DATASETS
Enrique BlancoXavier Messeguer
Roderic Guigó
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
1. SEQUENCE AND FUNCTION
SIMILARSEQUENCE
SIMILAR FUNCTION
Transthyretin NP_000362 (human) - NP_036813 (rat):
MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRGSPAINVAVMASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVRGSPAVDVAV*** **:*******:*.***** *:********************::***
HVFRKAADDTWEPFASGKTSESGELHGLTTEEEFVEGIYKVEIDTKSYWKKVFKRTADGSWEPFASGKTAESGELHGLTTDEKFTEGVYRVELDTKSYWK:**:::**.:*********:**********:*:*.**:*:**:*******
ALGISPFHEHAEVVFTANDSGPRRYTIAALLSPYSYSTTAVVTNPKEALGISPFHEYAEVVFTANDSGHRHYTIAALLSPYSYSTTAVVSNPQN*********:*********** *:******************:**::
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
2. FUNCTION AND SEQUENCE
SIMILARSEQUENCE
SIMILAR FUNCTION
ThiL gene (S. typhimurium) encoding thiamin phosphate kinase can be displaced (functionally equivalent) by THI80 (S. cerevisiae), encoding thiamin pyrophosphokinase.
Comparison of the known structure of THI80 with the structure of ThiL reveals different folds. Thus, two different folds might catalyze the same reaction.
Systematic discovery of analogous enzymes in Thiamin biosynthesis. Morett, Korbel, Rajan, Saab-Rincon, Olvera, Olvera, Schmidt, Snel, Bork. Nature Biotechnology 21, 790 - 795 (2003).
MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL--MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQKI.* .: :: :. :::.. :. .: * *:.** * .:.:
VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLPEVDEPWDIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-LPNYIIGD. :: .* . . . * * * : **:
LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYIPAGRALKRSGLDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISLHFNSPEFRSLI*:::*:.:: . .: :* * . :: :.: . . ::
AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEETDAHYLIQR----HLRSNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKVTPISLLALGGIGGRFD:: .: * :.. .: : * .*: * * ::
PTPRILHGQALRDIASAAIDLSDGLISDLGHIVKASGCGARVDVDALPKSQTVHSITQLYTLSENASYFKLCYMTPTDLIFLIKKNGTLIEYDPQFRNTC* : : . :: :.*. :** .::* .* . * : ..
DAMMRHVDDGQALRWALSGGEDYELCFTVPELNRGALDVAIGQLGVPFTCIGNCGLLPIGEATLVKETRGLKWDVKNWPTSVVTGRVSSSNRFVGDNCCF. : *:* : * .::: ..: * :. : :*
IGQMSADIEGLNFVRDGMPVTFDWKGYDHFATPIDTKDDIILNVEIFVDKLIDFL-----------*. . * .:::. * : :
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
3. FUNCTION AND SEQUENCE (TFBSs)
HNF1- binding sites (human)
------------AGTTAATCATTGGCC--------- -------------GTTAATTATTGGCAAATGTCCC- -------GTATGGGTTACTTATTCTCTCTTTGTTGA ------------GGTTAAGACTCTAAT--------- -------AGTCTAGTTAATAATCTACAATT------ ---------TGAGATTAATA---------------- ---------AATGATTAAAA---------------- -------------GTCAAACATTAAC---------- ----------CCGATTAACCATTAACCCCCACCCC- -------------GTTAATCAGAAAA---------- GGATGTATGTAGAATTACATAAGAA----------- -------------CTTACTCAATAAC----------
SIMILARSEQUENCE
SIMILAR FUNCTION
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
5. TF-MAPS: A NEW FORM OF ALIGNMENT
MAP 1
MAP 2
We can align the TF-MAPS in this new alphabet:•Mapping score•Gaps•Positional conservation
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
6. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION
TTR gene: ENSG00000118271
Pairwise TF-map alignments between TTR and 83 COREG(TTR) in CISRED
A.G. Robertson et al. cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Research, 34:D68–D73, 2006.
TTR PROMOTER RECONSTRUCTION
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
7. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY
The HRCZ-set(36 genes)
SEQUENCE ALIGNMENT
VsTF-MAP
ALIGNMENT
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
8. RESULTS
TF-map alignments are a simple reflection of sequence conservation?
Genomic region TOP 1 Avg. Score
Coding 27 3706.72
5’UTR 4 2671.78
PROMOTER 4 2005.67
3’UTR 1 1994.22
Intronic 0 1267.89
Downstream 0 1174.28
5’Intergenic 0 1052.92
3’Intergenic 0 974.69
CLUSTALW
NO
TOP 1 Avg. Score
6 17.15
2 10.48
18 25.41
7 15.85
2 8.34
0 6.85
0 5.42
1 4.14
TF-MAP ALIGNMENT
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
9. PAIRWISE TF-MAP ALIGNMENT TRAINING
TRAINING:To systematically estimate the parameters that are globally optimal, in terms of real TFBS detection, in a set of well-annotated promoter pairs
Predictions obtained with the database TRANSFAC: V. Matys et al. TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34: D108 - D110 (2006)
Plots with the program gff2ps: J. F. Abril and R. Guigó. gff2ps: visualizing genomic annotations. Bioinformatics, 8:743–744 (2000)
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
10. ACCURACY TESTS
REAL pair of TFBS
H
H
M
M TF-MAP ALIGNMENT
Levels:• Nucleotide• Site
Measures:• Sensitivity [0,1]• Specificity (PPV) [0,1] • Correlation Coefficient [-1,1]
Coverage
A set of experimentally annotated promoters:• The promoter sequences (mapping)• Coordinates of the real TFBSs (alignment)• TFBSs present in both promoters (alignment)
Human/Mouse orthologous genes
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
11. SOURCES OF INFORMATION
• General Regulatory Repositories
• Publications:- The datasets of other programs- Individual experimental works
FORMATS / QUALITY / AVAILABILITY / STABILITY
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
13. MY OWN EXPERIENCE (1)
MANUAL DATA CURATION:
* FINDING THE PROMOTER SEQUENCES IN THE GENOME:1. The original promoter entry does not exist (GenBank)2. The gene has another name3. The gene has not been annotated yet (RefSeq)4. The promoter sequence does not match the current TSS (RefSeq)5. The promoter sequence is not a promoter sequence (RefSeq)
* FINDING THE MOTIFS IN THE PROMOTER SEQUENCES:1. The binding motif is not in the original promoter sequence2. The motif is not in the coordinates that it was expected to be3. The motif has changed slightly (a few nucleotides)4. There are several motifs that could correspond to the real one5. The relative position among the motifs of the same gene is wrong
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
14. MY OWN EXPERIENCE (2)
* TF-MAPS AND ANNOTATIONS:1. The mapping function is not defined for a given TF2. The TFBS is not predicted by the mapping function in one of the
orthologs
* MATCHING THE ALIGNMENTS AND THE ANNOTATIONS:1. There are several mapping definitions that recognize the same
motif
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
16. SITES IN OTHER SPECIES
COLLAGENASE-3 GENE (MMP13) promoters kindly provided by Dr. López-Otín (Universidad de Oviedo)
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
17. ENCODE ChIP data
TRANSFAC:V$E2F1_Q3
10,000 bps
mouse
human
Enrique Blanco [http://genome.imim.es] - RegCreative Jamboree 2006
RESEARCH ON GENE REGULATION: DUAL PERSONALITY?
COMPUTER SCIENTISTEXPERT 1
BIOINFORMATICIANEXPERT 2
EXPERIMENTALIST?EXPERT 3
RESEARCHER
[email protected]@imim.es