1
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman, Bruce J Aronow Cincinnati Children’s Hospital Medical Center, Cincinnati, OH-45229 Abstract : A singular efficient method to decipher the underlying transcriptional control elements in higher eukaryotic genomes is still elusive. We have explored the extension of comparative genomics approaches to tackle this problem using known TF binding sites. Starting with an earlier developed method for identification of conserved cis-elements that are contained within evolutionarily conserved genomic regions (http://trafac.chmcc.org), we extended the query to identify compositionally similar cis-regulatory element clusters that occur in groups of co-expressed genes within each of their ortholog-pair evolutionarily conserved cis-regulatory regions (“peak analyzer”). We have tested series of co-regulated ortholog pairs of promoters and genes using known regulatory regions as training sets and microarray array profile data based co-expressed genes as test sets in the central nervous system, liver, olfactory and immuno-hematologic systems. Our results suggest that this combinatorial approach is broadly sensitive for the identification of known and potential regulatory regions containing conserved cis-elements for known compartment-specific trans-acting factors. However, sensitive detection of some known regulatory regions leads to an abundance of apparently false positives. We believe this approach can be substantially refined by improvement in the use of compositional similarity algorithms and weighted detection of preferred architecture models. Method : Gene 2: Hs-Mm Gene 4: Hs-Mm Gene 3: Hs-Mm Gene 5: Hs-Mm Gene 1: Hs-Mm Local Alignment Similarity Score: 3074 Match Percentage: 51 % Number of Matches: 96 Number of Mismatches: 39 Total Length of Gaps: 52 Begins at (8281,8874) and Ends at (8416,9059) Seq 1 <--> Seq 2 Sim% Nt 8281-8300 <--> 8874-8893 70% (20 nt) 8301-8310 <--> 8902-8911 90% (10 nt) 8311-8324 <--> 8923-8936 57% (14 nt) 8325-8376 <--> 8947-8998 62% (52 nt) 8378-8386 <--> 8999-9007 67% (9 nt) 8387-8416 <--> 9030-9059 90% (30 nt) V$ETSF/ETS1_B 8333 -8347 V$STAT/STAT1_01 8335 -8355 V $E TS F/P U 1_B 8335 -8350 V $E TS F/G A B P _B 8336 -8347 V$ETSF/NRF2_01 8338 -8347 V $C LO X /C D P C R 3_01 8363 -8377 V$ETSF/ETS1_B 8880-8894 V$STAT/STAT1_01 8881-8901 V $E TS F/P U 1_B 8882-8897 V$ETSF/NRF2_01 8892-8902 V $C LO X /C D P C R 3_01 8908-8922 V $G A TA /G A TA _C 8916-8928 Seq 1 <--> Seq 2 Sim% Nt Hits 8301-8310 <--> 8902-8911 90% (10 nt) 3 8311-8324 <--> 8923-8936 57% (14 nt) 2 8325-8376 <--> 8947-8998 62% (52 nt) 3 8378-8386 <--> 8999-9007 67% (9 nt) 0 8387-8416 <--> 9030-9059 90% (30 nt) 4 Trafac GeneChip Experiments A set of Coordinately Expressed Genes BlastN/Blat Search for genomic sequence retrieval TF Binding Sites TF Binding Sites >Seq 1 Human/Mouse Genomic AGAGAAAATTGCTAGAGCTCAGGA GTTTGAGACCAGCCTGGGCAATAG AGTAAGACTTTGTCTCTATCAAAA ATTTAAAAATTAACTGGGCTTGGC GGTGTGCACCTGTGGTCCAGCTAC TCAGGAGGCTGAGGTGGGAGGATT GCTTGAGCCCAAGA >Seq 2 Mouse/Human Genomic GACTGAGGGCTTGTGAAACAGCAAG AACCTGTCTCAAAAAACAGTGGGCA GGGAGGGGATTAATGAATAGGCAGC TACGTTCTGGGACTGGAGGGACTCG AGGTGGCTAGAAAGCAAGAGGTACT GGGAGACAAGGCTGCAGACATTTCT TTTTTTACTAGAGTC BlastZ ESTs/ cDNAs Seq 7 <--> Seq 8 Sim% Nt Hits 8301-8310 <--> 8902-8911 90% (10 nt) 3 8311-8324 <--> 8923-8936 57% (14 nt) 2 8325-8376 <--> 8947-8998 62% (52 nt) 3 8378-8386 <--> 8999-9007 67% (9 nt) 0 8387-8416 <--> 9030-9059 90% (30 nt) 4 Seq 3 <--> Seq 4 Sim% Nt Hits 8301-8310 <--> 8902-8911 90% (10 nt) 3 8311-8324 <--> 8923-8936 57% (14 nt) 2 8325-8376 <--> 8947-8998 62% (52 nt) 3 8378-8386 <--> 8999-9007 67% (9 nt) 0 8387-8416 <--> 9030-9059 90% (30 nt) 4 Seq 5 <--> Seq 6 Sim% Nt Hits 8301-8310 <--> 8902-8911 90% (10 nt) 3 8311-8324 <--> 8923-8936 57% (14 nt) 2 8325-8376 <--> 8947-8998 62% (52 nt) 3 8378-8386 <--> 8999-9007 67% (9 nt) 0 8387-8416 <--> 9030-9059 90% (30 nt) 4 Seq 9 <--> Seq 10 Sim% Nt Hits 8301-8310 <--> 8902-8911 90% (10 nt) 3 8311-8324 <--> 8923-8936 57% (14 nt) 2 8325-8376 <--> 8947-8998 62% (52 nt) 3 8378-8386 <--> 8999-9007 67% (9 nt) 0 8387-8416 <--> 9030-9059 90% (30 nt) 4 Peak-Analyzer Gene 1 : Hs-Mm, Gene 2 : Hs-Mm, Gene 3 : Hs-Mm, … Gene n : Hs-Mm Peak Analyzer Coordinately Expressed Genes in Olfactory Mucosa: Three genes with high levels of expression in Olfactory Mucosa shared several clusters of cis-elements. Each of these clusters was also conserved in human and mouse. The window size ranged from 200 to 300 base pairs. Two of the genes (XM_134943 and XM_143313) depicted here encode hypothetical proteins while the third is TPD52 (Tumor protein D52) (Genter et al., 2003). Coordinately Expressed Genes in Cerebellum: Conserved Cis-element clusters (200-300 base pair window) between human and mouse homologs and shared by four genes (ATP2A2 (Ca++- ATPase); HPCAL1 (Hippocalcin-like 1); CACNA1A (P/Q type Ca channel alpha 1A); and PLA2G7 (phospholipase A2 group VII)). highly expressed in Cerebellum (Zhang et al., 2003). Skeletal Muscle Genes - Regulogram depiction of shared cis-elements: Horizontal bars with colored segments (exons) are human and mouse genomic sequences. The different colored quadrilaterals are regions of alignment. Within each of these blocks, the % sequence similarity and the number of TF-binding sites are represented as two separate line graphs. TraFaC images of the experimentally validated regulatory regions of Skeletal Muscle genes (represented as blue circle on regulograms): The two gray vertical bars are the two genes that are compared. The TF-binding sites occurring in both the genes are highlighted as various colored bars drawn across the two genes. DES: Upstream Enhancer Region MYL1: Intronic Enhancer Region CKM: Upstream Enhancer Region ENO3: Intronic Enhancer Region Peak Analyzer: After the initial genomic sequence alignment of orthologous skeletal muscle genes (DES (Desmin), MYL1 (Myosin light polypeptide 1), CKM (creatine kinase muscle) and ENO3 (enolase 3 beta, muscle)), the “peaks” or “hits” (common cis-elements between orthologous gene pair and occurring in conserved genomic regions) were compared to identify shared cis-regulatory modules. The identified cis clusters included the experimental validated regulatory regions in each of these genes and comprised of multiple muscle regulatory cis-elements (Wasserman and Fickett, 1998) . The horizontal lines are the genomic sequences of the base species (human in this case). Yellow vertical bars are the exons. The different colored boxes represent the different cis-clusters. Limitations: 1. Cis-elements that are not conserved across the orthologous genes cannot be identified even though they occur in regions of sequence similarity across the species.. 2. Cis-elements that occur in non-aligned genomic regions across the two species cannot be identified by this approach. References: http://trafac.chmcc.org Support: HHMI and NIEHS U01 ES11038 Mouse Centers Genomics Consortium Conclusions: 1.The combinatorial approach of identifying coordinately regulated genes that share compositional similarity of cis-elements within their orthologous non-coding genomic regions offers a powerful filter that can aid in the identification of potential functional cis-clusters. 2.Peak analyzer appears capable of identifying known and novel regulatory modules within a cluster of coordinately regulated genes. 3.These novel cis-element modules may be useable as probes for genome wide annotation of potential regulatory regions.

Identification of Compositionally Similar Cis -element Clusters in Coordinately Regulated Genes

  • Upload
    leland

  • View
    18

  • Download
    1

Embed Size (px)

DESCRIPTION

GeneChip Experiments A set of Coordinately Expressed Genes. ESTs/cDNAs. BlastN/Blat Search for genomic sequence retrieval. - PowerPoint PPT Presentation

Citation preview

Page 1: Identification of Compositionally Similar  Cis -element Clusters in Coordinately Regulated Genes

Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated GenesAnil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman, Bruce J Aronow

Cincinnati Children’s Hospital Medical Center, Cincinnati, OH-45229Abstract: A singular efficient method to decipher the underlying transcriptional

control elements in higher eukaryotic genomes is still elusive. We have explored

the extension of comparative genomics approaches to tackle this problem using

known TF binding sites. Starting with an earlier developed method for

identification of conserved cis-elements that are contained within evolutionarily

conserved genomic regions (http://trafac.chmcc.org), we extended the query to

identify compositionally similar cis-regulatory element clusters that occur in

groups of co-expressed genes within each of their ortholog-pair evolutionarily

conserved cis-regulatory regions (“peak analyzer”). We have tested series of co-

regulated ortholog pairs of promoters and genes using known regulatory regions

as training sets and microarray array profile data based co-expressed genes as test

sets in the central nervous system, liver, olfactory and immuno-hematologic

systems. Our results suggest that this combinatorial approach is broadly sensitive

for the identification of known and potential regulatory regions containing

conserved cis-elements for known compartment-specific trans-acting factors.

However, sensitive detection of some known regulatory regions leads to an

abundance of apparently false positives. We believe this approach can be

substantially refined by improvement in the use of compositional similarity

algorithms and weighted detection of preferred architecture models.

Method:

Gene 2: Hs-Mm

Gene 4: Hs-Mm

Gene 3: Hs-Mm

Gene 5: Hs-Mm

Gene 1: Hs-Mm

Local Alignment Similarity Score: 3074 Match Percentage: 51 % Number of Matches: 96 Number of Mismatches: 39 Total Length of Gaps: 52 Begins at (8281,8874) and Ends at (8416,9059)

Seq 1 <--> Seq 2 Sim% Nt8281-8300 <--> 8874-8893 70% (20 nt)8301-8310 <--> 8902-8911 90% (10 nt)8311-8324 <--> 8923-8936 57% (14 nt)8325-8376 <--> 8947-8998 62% (52 nt)8378-8386 <--> 8999-9007 67% (9 nt)8387-8416 <--> 9030-9059 90% (30 nt)

V$ETSF/ETS1_B 8333 - 8347

V$STAT/STAT1_01 8335 - 8355

V$ETSF/PU1_B 8335 - 8350

V$ETSF/GABP_B 8336 - 8347

V$ETSF/NRF2_01 8338 - 8347

V$CLOX/CDPCR3_01 8363 - 8377

V$ETSF/ETS1_B 8880-8894

V$STAT/STAT1_01 8881-8901

V$ETSF/PU1_B 8882-8897

V$ETSF/NRF2_01 8892-8902

V$CLOX/CDPCR3_01 8908-8922

V$GATA/GATA_C 8916-8928

Seq 1 <--> Seq 2 Sim% Nt Hits8301-8310 <--> 8902-8911 90% (10 nt) 38311-8324 <--> 8923-8936 57% (14 nt) 28325-8376 <--> 8947-8998 62% (52 nt) 38378-8386 <--> 8999-9007 67% (9 nt) 08387-8416 <--> 9030-9059 90% (30 nt) 4

Trafac

GeneChip ExperimentsA set of Coordinately Expressed Genes

BlastN/Blat Search for genomic sequence retrieval

TF Binding Sites TF Binding Sites

>Seq 1 Human/Mouse Genomic AGAGAAAATTGCTAGAGCTCAGGAGTTTGAGACCAGCCTGGGCAATAGAGTAAGACTTTGTCTCTATCAAAAATTTAAAAATTAACTGGGCTTGGCGGTGTGCACCTGTGGTCCAGCTACTCAGGAGGCTGAGGTGGGAGGATTGCTTGAGCCCAAGA

>Seq 2 Mouse/Human Genomic GACTGAGGGCTTGTGAAACAGCAAGAACCTGTCTCAAAAAACAGTGGGCAGGGAGGGGATTAATGAATAGGCAGCTACGTTCTGGGACTGGAGGGACTCGAGGTGGCTAGAAAGCAAGAGGTACTGGGAGACAAGGCTGCAGACATTTCTTTTTTTACTAGAGTC

BlastZ

ESTs/cDNAs

Seq 7 <--> Seq 8 Sim% Nt Hits8301-8310 <--> 8902-8911 90% (10 nt) 38311-8324 <--> 8923-8936 57% (14 nt) 28325-8376 <--> 8947-8998 62% (52 nt) 38378-8386 <--> 8999-9007 67% (9 nt) 08387-8416 <--> 9030-9059 90% (30 nt) 4

Seq 3 <--> Seq 4 Sim% Nt Hits8301-8310 <--> 8902-8911 90% (10 nt) 38311-8324 <--> 8923-8936 57% (14 nt) 28325-8376 <--> 8947-8998 62% (52 nt) 38378-8386 <--> 8999-9007 67% (9 nt) 08387-8416 <--> 9030-9059 90% (30 nt) 4

Seq 5 <--> Seq 6 Sim% Nt Hits8301-8310 <--> 8902-8911 90% (10 nt) 38311-8324 <--> 8923-8936 57% (14 nt) 28325-8376 <--> 8947-8998 62% (52 nt) 38378-8386 <--> 8999-9007 67% (9 nt) 08387-8416 <--> 9030-9059 90% (30 nt) 4

Seq 9 <--> Seq 10 Sim% Nt Hits8301-8310 <--> 8902-8911 90% (10 nt) 38311-8324 <--> 8923-8936 57% (14 nt) 28325-8376 <--> 8947-8998 62% (52 nt) 38378-8386 <--> 8999-9007 67% (9 nt) 08387-8416 <--> 9030-9059 90% (30 nt) 4

Peak-Analyzer

Gene1: Hs-Mm, Gene2: Hs-Mm, Gene3: Hs-Mm, … Genen: Hs-Mm Peak Analyzer

Coordinately Expressed Genes in Olfactory Mucosa: Three

genes with high levels of expression in Olfactory Mucosa shared

several clusters of cis-elements. Each of these clusters was also

conserved in human and mouse. The window size ranged from 200

to 300 base pairs. Two of the genes (XM_134943 and

XM_143313) depicted here encode hypothetical proteins while the

third is TPD52 (Tumor protein D52) (Genter et al., 2003).

Coordinately Expressed Genes in Cerebellum: Conserved Cis-

element clusters (200-300 base pair window) between human and

mouse homologs and shared by four genes (ATP2A2 (Ca++-

ATPase); HPCAL1 (Hippocalcin-like 1); CACNA1A (P/Q type Ca

channel alpha 1A); and PLA2G7 (phospholipase A2 group VII)).

highly expressed in Cerebellum (Zhang et al., 2003).

Skeletal Muscle Genes - Regulogram depiction of shared cis-elements: Horizontal bars with colored segments (exons) are human and mouse genomic sequences. The different colored quadrilaterals are regions of alignment. Within each of these blocks, the % sequence similarity and the number of TF-binding sites are represented as two separate line graphs.

TraFaC images of the experimentally validated regulatory regions of Skeletal Muscle genes (represented as blue circle on regulograms): The two gray vertical bars are the two genes that are compared. The TF-binding sites occurring in both the genes are highlighted as various colored bars drawn across the two genes.

DES: Upstream Enhancer Region MYL1: Intronic Enhancer Region

CKM: Upstream Enhancer Region ENO3: Intronic Enhancer Region

Peak Analyzer:

After the initial genomic sequence

alignment of orthologous skeletal muscle

genes (DES (Desmin), MYL1 (Myosin light

polypeptide 1), CKM (creatine kinase

muscle) and ENO3 (enolase 3 beta,

muscle)), the “peaks” or “hits” (common

cis-elements between orthologous gene pair

and occurring in conserved genomic

regions) were compared to identify shared

cis-regulatory modules. The identified cis

clusters included the experimental validated

regulatory regions in each of these genes

and comprised of multiple muscle

regulatory cis-elements (Wasserman and

Fickett, 1998) . The horizontal lines are the

genomic sequences of the base species

(human in this case). Yellow vertical bars

are the exons. The different colored boxes

represent the different cis-clusters.

Limitations:

1. Cis-elements that are not conserved across

the orthologous genes cannot be identified

even though they occur in regions of

sequence similarity across the species..

2. Cis-elements that occur in non-aligned

genomic regions across the two species

cannot be identified by this approach.

References:

http://trafac.chmcc.org

Support:

HHMI and NIEHS U01 ES11038 Mouse Centers Genomics Consortium

Conclusions:

1. The combinatorial approach of identifying coordinately regulated genes that

share compositional similarity of cis-elements within their orthologous non-

coding genomic regions offers a powerful filter that can aid in the

identification of potential functional cis-clusters.

2. Peak analyzer appears capable of identifying known and novel regulatory

modules within a cluster of coordinately regulated genes.

3. These novel cis-element modules may be useable as probes for genome wide

annotation of potential regulatory regions.