39
In silico identification of novel cis-regulatory elements in Mesorhizobium loti Feroz Khan (M.Tech. Biotech) 1 , Shipra Agrawal (Ph.D) 2 and B. N. Mishra (Ph.D.) 3§ 1,3 Department of Biotechnology, Institute of Engineering & Technology, Lucknow 2 Jawahar Lal Nehru Center of Advanced Research, Bangalore, India E-mail: 1 [email protected], 2 [email protected] & 3 [email protected] § Corresponding author Dr. B.N. Mishra Assistant Professor & Head Department of Biotechnology Institute of Engineering & Technology, Sitapur Road, Lucknow-226012 (U.P.), India Phone: +91 522 2363220, 2733148 Ext. 206 (O), +91 522 2731636 (Fax) Email: [email protected] ABSTRACT A computational approach was designed to detect over- represented hexanucleotide(s) located within -400 bp upstream sequences of four data set of genes similar to cellular functional categories viz. Nitrogen fixation, Symbiosis, Nitrogen metabolism and Glutamate family in Mesorhizobium loti; a symbiont to model legume plant Lotus japonicus. The upstream sequences of these genes were analyzed for known transcription factor (TF) binding site(s) and then verified statistically along with experimental data comparisons using over-represented hexanucleotide frequencies. Finally eight families of known TF/binding sites were recognized in all sets as high affinity novel motif patterns. Genome wide occurrence of detected patterns was verified, which had several nif genes, nod genes, nitrogen metabolism related genes and amino acid biosynthetic genes. These findings in the genome of M. loti may lead to more intricate analysis of regulatory network involved in symbiotic interaction with the host plant L. japonicus. - 1 -

A sample article titleusers.comcen.com.au/~journals/ojb/fulltext2005/noncodi…  · Web viewNon-coding regions are of interest since they govern the regulation of gene expression

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

In silico identification of novel cis-regulatory elements in Mesorhizobium loti

Feroz Khan (M.Tech. Biotech)1, Shipra Agrawal (Ph.D)2 and B. N. Mishra (Ph.D.)3§

1,3Department of Biotechnology, Institute of Engineering & Technology, Lucknow2Jawahar Lal Nehru Center of Advanced Research, Bangalore, IndiaE-mail: [email protected], [email protected] & [email protected]§Corresponding author

Dr. B.N. MishraAssistant Professor & HeadDepartment of BiotechnologyInstitute of Engineering & Technology,Sitapur Road, Lucknow-226012 (U.P.), India Phone: +91 522 2363220, 2733148 Ext. 206 (O), +91 522 2731636 (Fax)Email: [email protected]

ABSTRACT

A computational approach was designed to detect over-represented hexanucleotide(s) located within -400 bp upstream sequences of four data set of genes similar to cellular functional categories viz. Nitrogen fixation, Symbiosis, Nitrogen metabolism and Glutamate family in Mesorhizobium loti; a symbiont to model legume plant Lotus japonicus. The upstream sequences of these genes were analyzed for known transcription factor (TF) binding site(s) and then verified statistically along with experimental data comparisons using over-represented hexanucleotide frequencies. Finally eight families of known TF/binding sites were recognized in all sets as high affinity novel motif patterns. Genome wide occurrence of detected patterns was verified, which had several nif genes, nod genes, nitrogen metabolism related genes and amino acid biosynthetic genes. These findings in the genome of M. loti may lead to more intricate analysis of regulatory network involved in symbiotic interaction with the host plant L. japonicus.

Keywords: Regulatory binding motifs, TF binding sites in M. loti, Hexanucleotides motifs, Nitrogen fixing bacteria.

INTRODUCTION

The majority of computational analyses has been done in coding sequences or between proteins and less in non-coding sequences [van Helden, J, 2003]. Non-coding regions are of interest since they govern the regulation of gene expression. Regulatory profiles of known and unknown genes are already being determined experimentally at a genomic scale, by DNA microarray technology [Schena et al., 1995; Schena et al., 1996; Schena M, 1996; Goffeau et al., 1997; Lashkari et al., 1997; DeRisi et al., 1997]. Besides,

- 1 -

several programs have been developed to isolate unknown patterns shared by sets of functionally related DNA sequences [Waterman et al., 1984; Galas et al., 1985; Mengeritsky G and Smith TF, 1987; Stormo GD & Hartzell GW, 1989; Hertz et al., 1990; Lawrence C & Reilly A, 1990; Cardon LR and Stormo GD 1992; Lawrence et al., 1993; Neuwald et al., 1995; Hertz G and Stormo G, 1995; Wolfertstetter et al., 1996; Bulyk et al., 2004]. These programs have been inspired by a particular type of signals and are generally highly efficient for the detection of elements.

In the present work, a simple analysis method was designed to detect over-represented novel hexanucleotides in the upstream sequences of four set of genes, belonging to co-related functional categories viz., Nitrogen fixation, symbiosis, Nitrogen metabolism and glutamate family in M. loti [Kaneko et al., 2000]. The upstream of these genes were analyzed for known transcription factor (TF) binding site(s) using prokaryotic transcription factor database (ooTFD) and identified motifs were further verified by statistical oligonucleotides analysis.

MATERIALS AND METHODS

Retrieval of genes

Retrieval of M. loti gene sets sequences [Kaneko et al., 2000], belonging to functional category viz., Symbiosis, Nitrogen metabolism, Glutamic acid family and Nitrogen fixation were taken from the Rhizobase server (http://www.kazusa.or.jp/rhizobase/). The purpose behind the selection of this gene was to relate it to different nitrogen metabolic pathways (Tables 1, 2, 3 and 4).

Retrieval of upstream sequences without overlapping ORF upstream

Due to the organization of genes into operons in bacteria, it is preferable to prevent overlap with upstream open reading frames (ORFs) and prevent including too many coding sequences. For example, in Escherichia coli, 25% of the genes have an upstream neighbour closer than 50 bp which may suggest that they belong to the same operons [van Helden J, 2003; Siegele et al., 1989; Bulyk et al., 2004].

Identification of known regulatory factor/ binding site

- 2 -

The upstream sequences of all four genes sets were analyzed for known TF-binding sites through ‘Tfsitescan’ search tool at object oriented transcription factors database (ooTFD) server [Ghosh D, 2000] (Tables 5, 6, 7 and 8).

Clustering of genes on the basis of identified TF- binding site(s) family: All studied sets of genes were categorized into regulatory families on the basis of identified known transcription factor/ binding site(s) along with corresponding known sites, their length and binding known transcription factor, if any (Table 9). The details of each known transcription factor/ binding site with their references are summarized in Table 10.

Hexanucleotide analysis

In M. loti, detection of potential upstream hexanucleotide patterns in each set were performed along with genome wide pattern (s) identification. Modules for oligo analysis can be accessed through Regulatory Sequence Analysis Tools (RSAT) server [van Helden et al., 2000; van Helden J, 2003] at web interface http://rsat.ulb.ac.be/rsat/. In the analysis, different statistical parameters were used for validation of true positive predictions. According to van Helden J. (1998) different statistics parameters can be defined as:

Expected occurrences (exp_occ): the number of occurrences expected for the considered oligonucleotide within the set of sequences. The calculation of this value depends on the probabilistic model.

Occurrence probability (occ_pro): the probability to have N or more occurrences, given the expected number of occurrences (where N is the observed number of occurrences).

Expected matching sequences (exp_ms): the expected number of sequences with at least one occurrence.

Matching sequence probability (ms_pro): the probability to have L or more sequences with at least one ocurrence of the oligonucleotide, given the probabilistic model (where L is the observed number of matching sequences).

Significance index (sig): this is a conversion of the occurrence probability, taking into account the number of possible oligonucleotides (which varies with oligo size) and doing a logarithmic transformation. The highest sig corespond to the most overrepresented oligonucleotide. Sig value higher than zero indicate overrepresentation.

- 3 -

PROBABILITIESVarious calibration models were used to estimate the probability of each oligonucleotide. From there, expected number of occurrences were calculated and compared to the observed number of occurrences. The significance of the observed number of occurrences is calculated with the binomial formulae (van Helden J., 1998):

Expected occurrences

Where, p = probability of the patternS = number of sequences in the sequence set.

Lj = length of the jth regulatory regionk = length of oligomerT = the number of possible matching positions.

Probability of sequence matching

The probability to find at least one occurrence of the pattern within a single sequence is:

with the same abbreviations as above.

Expected number of matching sequences

In this counting mode, only the first occurrence of each sequence is taken into connsideration. We have thus to calculate a probability of first occurrence.

with the same abbreviations as above.

Correction for autocorrelation (from Mireille Regnier)

- 4 -

SExp_occ = p * T = p * SUM (Lj + 1 - k)

j=1

T q = 1 - (1-p)

Exp_ms = n (1 - (1 - p)^T)

Exp_ms_corrected = n (1 - (1 - p/a)^T)

Where, a = is the coefficient of autocorrelation

Probability of the observed number of occurrences (binomial)

The probability to observe exactly ‘obs’ occurrences in the whole family of sequences is calculated by the binomial as:

Where,obs = is the observed number of occurrences,

p = is the expected frequency for the pattern, T = is the number of possible matching positions,

The probability to observe ‘obs’ or more occurrences in the whole family of sequences is calculated by the sum of binomials:

E-value

The probability of occurrence by itself is not fully informative, because the threshold must be adapted depending on the number of patterns considered. The E-value represented the expected number of patterns which would be returned at random for a given P-value (probability).

Where, NPO =is the number of possible oligomers of the chosen

length.

Significance index or coefficient (“sig”)

The significance index is simply a negative logarithm conversion of the E-value (in base 10). The significance indexes are calculated as follows:

- 5 -

obs T-obs

P(obs) = bin(p,T,obs) = T! p (1-p) --------------- obs! * (T-obs)!

T obs-1P(>=obs) = SUM P(i) = 1 - SUM P(i)

i=obs i=0

E-value = NPO * P(>=obs)

This index value is very convenient to interpret: highest values correspond to the most exceptional patterns.

RESULTS

Bacterial metabolism has been widely studied and provides many examples of known regulons [Fisher et al., 1988; Hovey AK & Frank DW, 1995; Householder et al., 1999; Bulyk et al., 2004]. On many cases the TF involved in the common response is known, as well as its binding site [Ow DW et al., 1983]. These families of coregulated genes provide ideal datasets to calibrate the analyses method which could be extended to families whose regulatory elements are unknown. In the studied work several transcription factor families were predicted on the basis of DNA motif conservation. On a co-related nitrogen element metabolic pathways criteria basis, genes were grouped together without a priori consideration on the content of their upstream regions and then analyzed for detection of potential known TFs and their binding sites (Table 9) and then further verified through comparison with detected hexanucleotides in each studied genes sets (Table 11, 12 and 13). Finally refinement of noise has been performed on the basis of common genes having both known TF binding sites and also hexanucleotide patterns (Table 14). In the present studied work results revealed that finally identified common genes with known and unknown novel binding sites or hexanucleotide patterns should have important role in the regulation of nitrogen metabolic pathways, thus identified genes were responsible for regulation of nitrogen fixing symbiosis process between nitrogen fixing bacteria and host leguminous plants. The descriptions of results are as follows:

Clusters of overlapping hexanucleotides reveal wider regulatory sequences

For each data set (Table 1,2,3,4) we extracted -400 bp upstream sequences relative to transcription start site and performed hexanucleotide analyses as described in methodology. To avoid false positive predictions all hexanucleotide patterns in all upstream sequences were retained by setting standard statistical parameters e.g. significance coefficient (briefly “sig”) > 0 and number of occurrences (briefly “occ”) ≥ 4 and with the chosen cut-off parameter very few genes upstream sequences showed hexanucleotide patterns e.g. 37 out of total 167 genes upstream in all four sets (Table 11 and 12). Different significant statistical values corresponding to each hexanucleotide patterns were summarized in Table 13. Finally genes upstream with both hexanucleotide patterns

- 6 -

Sig_occ = -log10 (E-value)

and known TF-binding sites were retained e.g. 20 out of total 37 genes which showed 8 known TF-binding sites (Table 14). In most pattern clusters families, the hexanucleotides with higher significance coefficient and occurrences were assumed to be the novel regulatory binding sites. Highly significant patterns were generally appeared in clusters with few additional overlapping hexanucleotide patterns that have a weaker significant coefficient e.g., hexanucleotide of MalT family e.g. GGCAGA (sig=0.32), which can be grouped with another strongly overlapping sequence: CCCCAC (sig=0.62). When combined the two patterns, most significant hexanucleotide pattern correspond to 7 bp conserved sequence e.g. CGGCAGA, which was highly matched with 6 bp known TF binding consensus sequence of Malt_Cs. Similarly in ExsA family, the most salient hexanucleotide was ATAAAA (sig=2.70) which can be grouped with other strongly overlapping sequence e.g. AAACGT (sig=1.12). When combined the two patterns, most significant hexanucleotide correspond to 9 bp conserved sequence e.g. ATAAAACGT, highly matched with 8 bp long known TF binding consensus sequence of ExsA_Cs transcription factor (TNAAAANA). In most families viz. MalT, PhoP, ExsA and MalT_malPp, the overlapping clusters reflect the fact that recognition domain of the transcription factor is wider than 6 nucleotides with conserved core region of hexanucleotides. The maximum significance coefficient value indicates the most conserved hexanucleotides core that usually corresponds to bases directly interacting with the transcription factor. The decrease of significance value for the lateral overlaps comes from the fact that these positions are less crucial for the TF binding.

Transcription Factor Families covering clusters of variable patterns

On the basis of clusters of pattern three TF families were categorized viz. MalT, PhoP and ExsA (Table 14). Description of each family and related genes are as follows:

1. MalT familyIn the study 2 independent clusters of binding site (hexanucleotide pattern) for Malt_Cs transcription factor [Raibaud et al., 1985] were detected in which cluster-I showed higher affinity scoring pattern GGCAGA (sig=0.32 & occ=8) corresponding to known binding site (e.g. GGAKGA) for Malt_Cs TF and cluster-II with pattern ATAAAA (sig=2.70 & occ=6) defines low affinity consensus for the same TF.

2. PhoP familyFor PhoP factor, it is reported that PhoP-PhoR two-component regulatory system controls the phosphate deficiency response in Bacillus subtilis. A number of pho regulon genes which require PhoP for activation or repression have been identified [Eder et al., 1999].

- 7 -

A similar situation was observed here in the PhoP family, where 2 independent clusters of binding site for PhoP TF were detected in which cluster-I showed higher affinity scoring pattern CGATCG (sig=1.39 & occ=6) corresponding to known binding consensus (e.g. TTHACA) for PhoP TF and cluster-II with pattern ATAAAA (sig=2.70 & occ=6) showed low affinity to the same TF.

3. ExsA familyIt is reported that ExsA has been implicated as a central regulator of exoenzyme production by Pseudomonas aeruginosa [Hovey AK & Frank DW, 1995]. In the study we identified hexanucleotide pattern for the same TF. In ExsA family cluster-I showed high affinity pattern ATAAAA (sig=2.70 & occ=6) corresponding to known binding consensus (e.g. TNAAAANA) for the ExsA_Cs transcription factor, while cluster-II defines low affinity scoring consensus GGGATA (sig=0.42 & occ=4) for the same TF.

Putative unknown regulatory sites

Besides, known regulatory sites few unknown additional hexanucleotides were observed within families (Table 14). Based on the results for the known regulatory sites one can inferred that the ideal unknown site should appear as a cluster of overlapping hexanucleotides with higher significance coefficient. Fit with these criteria several unknown regulatory patterns were extracted from the hexanucleotide analysis and were considered as good candidates for new unknown regulatory sites on the basis of conservation. Putative predicted families of unknown hexanucleotides are as follows:

A. Families with cluster of unknown hexanucleotide patterns

We analysed 4 putative regulatory families viz. MalT, PhoP, ExsA & MalT_malPp where similar cluster of 2 hexanucleotides e.g. ATAAAA & AAACGT were appeared in MalT, PhoP and MalT_malPp regulatory families, in which pattern ATAAAA was considered highly significant due to higher significance coefficient value and number of occurrences (sig=2.70 & occ=6). Similarly, a cluster of 2 hexanucleotides e.g. GGGATA & GGCAGA appeared in the ExsA family, in which pattern GGGATA was considered highly significant due to higher significance & occurrence values (sig=0.42 & occ=6). Here varying oligonucleotide size revealed the expectation of same pattern with flanking nucleotides.

B. Families with single unknown hexanucleotide pattern

Total four putative regulatory families viz. MomR/oxyR, CAP/CRP, Nitrogen regulatory and Lambda were detected with single unknown hexanucleotide pattern. These are explained as follows:

- 8 -

1. MomR / oxyR family

It is reported that MomR protein is identical to OxyR, a regulatory protein responding to oxidative stress [Bolker M & Kahmann R, 1989]. Here, in MomR/oxyR family single hexanucleotide pattern e.g. AGCTTG was appeared in upstream sequence of Mlr6175 gene related to symbiosis and encode chitooligosaccharide deacetylase/nodulation protein; NodB, with lower significance & occurrence value (sig=0.19 & occ=7) and thus showed low affinity to known binding consensus sequence i.e. ATGCATCRW for the same e.g. MomR/oxyR_Cs transcription factor .

2. CAP/CRP family

Cyclic AMP (cAMP) and its receptor protein (CRP) have dual role in the regulation of the two promoters that control the galactose (gal) operon of E. coli [Taniguchi et al., 1979]. Here in CAP/CRP family single hexanucleotide pattern e.g. AATTCG was detected & found in the upstream sequence of nitrogen fixation related gene Mll4698, responsible to encode two-component system histidine protein kinase (FixL like), with pattern significance coefficient value of 0.28 and 7 occurrences, thus showed low affinity to known binding consensus e.g. ACACTTT for known TF (CAP/CRP-lac).

3. Nitrogen regulatory site family

In Nitrogen regulatory family single hexanucleotide pattern e.g. GGCAGA was detected in Glutamate family gene’s upstream Mlr1730, responsible to encode histidine ammonia-lyase, with significance coefficient of 0.32 and 8 occurrences. This defines low affinity to the known binding consensus TTTTGCA [Ow DW et al., 1983].

4. Lambda site family

In Lambda family, single hexanucleotide pattern ATTACC was detected in genes upstream related to symbiosis functional category e.g. gene Mll9683 responsible to encode protein AtsE, with significance coefficient 0.15 and 4 occurrences. It defines low affinity to the known binding consensus GGYGTRYG, thus expected as unknown regulatory site. For lambda protein it is reported that transcription anti-termination by the bacteriophage lambda-N protein is stimulated in vitro by the E. coli NusG protein [Zhou et al., 2002].

Known regulatory sites not detected through hexanucleotide analysis

- 9 -

Hexanucleotide analysis enabled us to detect 16 known regulatory sites out of 12 classified regulatory families in M.loti. Four known sites escaped detection through hexanucleotide analysis e.g. (i) binding site for Nod-factor in Nod family, (ii) TATA-box for RNA polymerase sigma factor in TATA-box family, (iii) binding site for NR(I) factor in NR(I) family and (iv) binding site for NarL/NarP-Cs factor in the NarL/NarP family. Contrary to all other families, not a single hexanucleotide had positive significance coefficient in these families.

DISCUSSION

It is well established that the nitrogen fixing symbiosis process between rhizobia and legumes are important for sustainable agricultural practices and contribute significant to the global nitrogen cycle. M. loti, the bacteria of rhizobia class make symbiosis with model legume plant L. japonicus. The genome of M. loti is completely sequenced [Kaneko et al., 2000] and the genome sequencing project of L. japonicus is under way. With the completion of genome sequencing project of L. japonicus, the molecular analysis of symbiotic relationship between these two can easily be studied. Particularly, the emphasis is required to be given on functional as well as regulatory genomics of M. loti. The genome sequencing data of M. loti has facilitated the availability of annotated genes classified under specific metabolic processes [Kaneko et al., 2000]. From nitrogen fixation point of view we have initially taken four set of genes facilitating symbiosis, nodulation, nitrogen fixation and glutamic acid metabolism. It is pertinent to mention that M. loti with 7.03 Mb genome size carries a 500 Kb transposable symbiotic island comprises clusters of genes responsible for making symbiosis as well as nodulation in L. japonicus [Kaneko et al., 2000]. The genes involved in these process have been annotated and being proved experimentally; whereas DNA motif involved in regulation of expression of functional genes have not been well studied. In the present work, a simple computational method has been optimized to identify upstream motifs relevant to gene regulation.

A total of 57 genes were grouped in the first symbiotic set and traced for known regulatory TF binding sites through TF-binding site detection tool (e.g. Tfsitescan at ooTFD server) in the corresponding genes upstream sequences. The accuracy of true positive prediction was accountable by significance coefficient (sig) value measured for individual TF binding site. A higher ‘sig’ value and maximum occurrence indicates significant prediction. Finally we identified TF binding sites for only 21 genes; all of them showed single TF binding site occurrences except two genes which showed twice number of occurrences e.g. known PhoP-consensus site responsible for regulation of nodulation protein expression i.e.

- 10 -

NoeK or phosphomannomutase (mll7567) and known MalT-CS site responsible to regulate expression of glutamine fructose-6 phosphate transaminase nodulation protein i.e. NodM (mlr6386). True positive predictions were statistically supported by higher range of expectation value (expec) e.g. from 1.20e-02 to 7.26e-02.

Similarly in the second set of nitrogen metabolism total 23 genes were analyzed for identification of known regulatory TF binding sites. Only 8 genes showed known binding sites and except one, all of them had single number of site occurrences in their upstream sequences. The known PhoP-consensus site responsible for regulating expression of nitrogen regulatory protein, P-II i.e. GlnK (mll4247) had two occurrences. In this set higher value of expectation value were ranges from 1.06e-02 to 5.14e-02.

In the third set of glutamate family total 35 genes were analyzed for identification of known regulatory TF binding sites, in which only 10 genes showed known binding sites. Out of 10, 2 genes i.e. glutamate synthetase-I (mll0313) and glutamate synthetase beta-subunit (mll1646) showed twice occurrences of known binding sites e.g. MalT-CS and ExsACS respectively. The expectation value ranges from 1.45e-01 to 5.46e-02.

Finally in the fourth set a total of 54 genes were grouped in nitrogen fixation set which were analyzed for known TF binding sites through known database i.e. TFD. Identified TF binding sites for 21 genes mostly showed binding patterns with single occurrences but gene mll5857 i.e. Nif-specific regulatory protein; nifa, showed 2 occurrences for its 2 types of the binding patterns i.e. ExsACS_(1) & ExsACS_(2). Here the expectation value of predicted occurrences ranges from 1.08e-01 to 7.56e-02. All the identified known binding sites belonging to above 4 functional categories were further verified and confirmed by oligonucleotide analysis. Besides, genes identified with multiple patterns of known binding sites showed significant number of occurrences, thus revealed to be the most potential binding sites.

Moreover, results of symbiosis genes set hexanucleotides analysis (Table 12) significantly showed four additional hexanucleotide patterns i.e. CCCCCA, CCCCAC, ATTACC & AGCTTG predicted to be responsible for regulation of two genes i.e. mll1107 & mll4979, single gene i.e. mll4979* (* means putative significant gene with known TF/site in their upstream sequence), three genes i.e. mll9683*, mlr2437 & mlr5801 and four genes i.e. mlr6175*, mlr6341, mlr6622 & mlr7575 respectively. Similarly in nitrogen metabolism genes set only one new additional unknown hexanucleotide pattern i.e. GAGCAC was detected, predicted to be potential binding site for regulating three genes i.e. mll1423,

- 11 -

mll4247 & mlr1320. On the other hand nitrogen fixation genes set showed three new additional unknown hexanucleotide patterns i.e. AATTCG, CAGGGA and CGATCG which were statistically verified as high affinity binding sites responsible for controlling expression of two genes i.e. mll3694* & mll4698*, four genes i.e. mlr3659*, mlr5871, mlr5906 & mlr5907 and two genes i.e. msl6623* & msr6418 respectively while glutamate family genes set showed four new additional unknown binding sites i.e. ATAAAA, AAACGT, GGGATA and GGCAGA, predicted to be responsible for expression regulation of four genes i.e. mll3030*, mll1646*, mll0343* & mll0601, three genes i.e. mll3030*, mll1560 & mll1646*, three genes i.e. mll3040, mll3074 & mll7254* and five genes i.e. mll9226, mlr0039*, mlr0339, mlr1730*, mlr3506 & mlr6209* respectively. Finally eight consensus oligonucleotide patterns were identified and further verified by matching with known binding sites i.e. ACACTTT, GGAKGA, TTHACA, TNAAAANA, TCCTCC, ATGCATCRW, TTTTGCA, GGYGTRYG which were earlier reported as known binding sites of known transcription factors/ TF binding site, namely CAP/CRP, MalT, PhoP, ExsA, MalT_MalPp site*, MomR/oxyR, Nitrogen_regulatory site* and Lambda site* (* means known sites with unknown TF) respectively (Table 14).

CONCLUSIONS

The present study covered wide range of annotated genes participating in different nitrogen related metabolic pathways and statistically analyzed by their co-regulation coherence with potency of TF binding to related cis-element. Predicted TF binding sites were satisfactorily validated by higher significant statistical values and further verified by matching with known TF binding sites. Finally eight families of known TF binding sites were recognized along with recognition of high affinity new hexanucleotide patterns for studied gene sets. Such findings in the genome of M. loti may lead for more intricate analysis of regulatory network involved in symbiosis process between rhizobia and host plant L. japonicus.

ACKNOWLEDGEMENTS

We acknowledge the Council of Scientific & Industrial Research (CSIR), New Delhi for financial support as a SRF (Biotechnology) and also All India Council for Technical Education (AICTE), New Delhi for financial support as M.Tech. Biotechnology Teaching Programme at Department of Biotechnology (A Centre of Excellence in Biotechnology), Institute of Engineering and Technology, Lucknow (U.P.), India.

REFERENCES

1. Bolker M and Kahmann R (1989). The Escherichia coli regulatory

- 12 -

protein oxyr discriminates between methylated and unmethylated states of the phage Mu mom promoter. EMBOJ, 8(8):2403-10

2. Bulyk ML, McGuire AM, Masuda N and Church GM (2004). A motif co-occurrence approach for genome-wide prediction of transcription-factor-binding sites in Escherichia coli. Genome Research, 14 (2):201-208.

3. Cardon LR and Stormo GD (1992). Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. Journal of Molecular Biology, 223:159-170.

4. DeRisi JL, Iyer VR and Brown PO (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680-686.

5. Eder S, Liu W and Hulett FM (1999). Mutational analysis of the phod promoter in Bacillus subtilis: implications for phop binding and promoter activation of Pho regulon promoters. Journal of Bacteriology, 181(7):2017-25.

6. Fisher RF, Egelhoff TT, Mulligan JT and Long SR (1988). Specific binding of proteins from Rhizobium meliloti cell free extracts containing nodD to DNA sequences upstream of inducible nodulation genes. Genes Dev., 2(3):282-93.

7. Galas DJ, Eggert M and Waterman MS (1985). Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from E. Coli. Journal of Molecular Biology, 186(1):117—128.

8. Ghosh D (2000). Object-oriented transcription factors database (ooTFD). Nucleic Acids Research, 1; 28(1): 308-10.

9. Goffeau A, Park J, Paulsen IT, Jonniaux JL, Dinh T, Mordant P, and Saier MH Jr. (1997). Multidrug-resistant transport proteins in yeast: complete inventory and phylogenetic characterization of yeast open reading frames within the major facilitator superfamily. Yeast, 13:43-54.

10. Hertz G and Stormo G (1995). Identification of Consensus Patterns in Un- aligned DNA and Protein Sequences: A Large-Deviation Statistical Basis for Penalizing Gaps. Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, p201-216.

11. Hertz GZ, Hartzell GW and Stormo GD (1990). Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Computer Applications in the Biosciences, 6:81-92.

12. Householder TC, Belli WA, Lissenden S, Cole JA and Clark VL, (1999). Cis- and trans-acting elements involved in regulation of ania, the gene encoding the major anaerobically induced outer membrane protein in Neisseria gonorrhoeae. Journal of Bacteriology, 181(2):541-51.

13. Hovey AK and Frank DW (1995). Analyses of the DNA binding and transcriptional activation properties of exsa, the

- 13 -

transcriptional activator of the Pseudomonas aeruginosa exoenzyme S regulon. Journal of Bacteriology, 177(15):4427-36.

14. Kaneko T. et al. (2000). Complete genome structure of the nitrogen-fixing symbiotic bacterium Mesorhizobium loti. DNA Research, 7:331-338

15. Lashkari DA, DeRisi LJ, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO and Davis RW (1997). Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proceedings of National Academy of Sciences, USA (PNAS), 94:13057-13062.

16. Lawrence C and Reilly A (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7 (1):41-51.

17. Lawrence C, Altschul S, Boguski M, Liu J, Neuwald A and Wootton J (1993). Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262 (5131):208-14.

18. Mengeritsky G and Smith TF (1987). Recognition of characteristic patterns in sets of functionally equivalent DNA sequences. Bioinformatics, Vol 3:223-227.

19. Neuwald AF, Liu JS and Lawrence CE (1995). Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Science, 4: 1618-1632.

20. Ow DW, Sundaresan V, Rothstein DM, Brown SE and Ausubel FM (1983). Promoters regulated by the glng (ntrc) and nifa gene products share a heptameric consensus sequence in the –15 region. Proceedings of National Academy of Sciences, USA, 80(9): 2524-8.

21. Raibaud O, Gutierrez C and Schwartz M (1985). Essential and nonessential sequences in malpp, a positively controlled promoter in Escherichia coli. Journal of Bacteriology, 161(3):1201-8.

22. Reitzer LJ and Magasanik B (1986). Transcription of glna in Escherichia coli is stimulated by activator bound to sites far from the promoter. Cell, 20; 45(6):785-92.

23. Schena M (1996). Genome analysis with gene expression microarrays. Bioassays, 18:427-431.

24. Schena M, Shalon D, Davis RW and Brown PO (1995) Quantitative monitoring of gene expression patterns with a complimentary DNA microarray. Science, 270:467-470.

25. Schena M, Shalon D, Heller R, Chai A, Brown PO and Davis RW (1996) Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proceedings of National Academy of Sciences, USA, 93:10614-10619.

26. Siegele DA, Hu JC, Walter WA and Gross CA (1989). Altered promoter recognition by mutant forms of the sigma 70 subunit of Escherichia coli RNA polymerase. Journal of Molecular Biology, 20; 206(4):591-603.

- 14 -

27. Stormo GD and Hartzell GW (1989). Identifying protein-binding sites from unaligned DNA fragments. Proceedings of National Academy of Sciences, USA, 86:1183-1187.

28. Taniguchi T, O’Neill M and de Crombrugghe B (1979). Interaction site of Escherichia coli cyclic AMP receptor protein on DNA of galactose operon promoters. Proceedings of National Academy of Sciences, USA, 76 (10): 5090-4.

29. van Helden J (2003). Prediction of transcriptional regulation by analysis of the non-coding genome. Current Genomics, 4(3):217-224.

30. van Helden J (2003). Regulatory sequence analysis tools. Nucleic Acids Research, 31(13):3593-6.

31. van Helden J, Andre B and Collado-Vides J (1998). Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology, 281(5):827-42.

32. van Helden J, Andre B and Collado-Vides J (2000). A web site for the computational analysis of yeast regulatory sequences. Yeast, 16(2):177-187.

33. Waterman MS, Smith TF and Katcher HL (1984). Algorithms for restriction map comparisons. Nucleic Acids Research, Vol. 12, Issue 1:237-242.

34. Wolfertstetter F, Frech K, Herrmann G and Werner T (1996). Identification of functional elements in unaligned nucleic acid sequences by a novel tuple search algorithm. Computer Applications in the Biosciences, 12(1):71—80.

35. Zhou Y, Filter JJ, Court DL, Gottesman ME and Friedman DL (2002). Requirement for nusg for transcription antitermination in vivo by the lambda N protein. Journal of Bacteriology, 184(12):3416-8.

- 15 -

Annexure-Tables

Table1. Set of genes belonging to symbiosis functional category in M. loti.

16

1. mll1026: rhizobiocin secretion protein; RspE 2. mll1027: rhizobiocin secretion protein; RspD 3. mll1107: outer membrane protein, NodT candidate 4. mll1143: nodulation protein N 5. mll1266: nodulation protein; NoeC 6. mll2768: acetyltransferase, nodulation protein; NodL 7. mll3788: weak similarity to NodH 8. mll4296: ferric leghemoglobin reductase-2 precursor, dihydrolipoamide dehydrogenase 9. mll4680: glycosyltransferase, contains similarity to NolL 10. mll4979: similar to MocC (rhizopine catabolism), also similar to myo-inositol catabolism; IolE 11. mll5320: virulence factor MviN-like protein 12. mll5661: nodulation protein; NoeI 13. mll5922: GDP-D-mannose dehydratase; nodulation protein; NoeL 14. mll6337: nodulation protein; NolX 15. mll6338: nodulation protein; NolW 16. mll6943: aquaporin, nodulin-like intrinsic protein 17. mll7567: nodulation protein NoeK, phosphomannomutase 18. mll9170: virulence factor SrfC homolog 19. mll9171: virulence factor SrfB homolog 20. mll9589: AtsE 21. mll9683: AtsE 22. mlr0024: HesB-like protein 23. mlr1006: rhizopine catabolism protein; ModC 24. mlr2192: O-acetyltransferase, NodL candidate 25. mlr2437: rhizopine catabolism protein; MocC 26. mlr3097: nodulation protein NodN-Rhizobium leguminosarum 27. mlr3249: glycosyltransferase; RedB [Sinorhizobium meliloti] megaplasmid 2 28. mlr4951: NodF 29. mlr4953: NodE 30. mlr5801: phosphomannomutase; NoeK 31. mlr5802: phosphomannose isomerase/GDP-mannose pyrophosphorylase; NoeJ 32. mlr5821: nodulation protein; NodF 33. mlr5822: nodulation protein; NodE 34. mlr5848: nodulation protein; NodZ 35. mlr5849: GDP-mannose 4,6-dehydratase; nodulation protein; NoeL 36. mlr6161: methyltransferase, nodulation protein; NodS 37. mlr6163: N-acetylglucosaminyltransferase, nodulation protein; NodC 38. mlr6164: nodulation ATP-binding protein; NodI 39. mlr6166: nodulation protein; NodJ 40. mlr6171: nodulation protein; NolO 41. mlr6175: chitooligosaccharide deacetylase, nodulation protein; NodB 42. mlr6339: nodulation protein; NolT 43. mlr6341: nodulation protein; NolV 44. mlr6386: glutamine-fructose-6-phosphate transaminase nodulation protein; NodM 45. mlr6622: similar to nodulin 21 46. mlr7028: opine oxidase subunit A 47. mlr7400: bacteroid development protein; BacA 48. mlr7575: nodulation protein NodP, sulfate adenylate transferase, subunit 2 49. mlr7576: nodulation protein NodQ, sulfate adenylate transferase, subunit 1 50. mlr7780: similar to Rhizopine catabolism protein mocD 51. mlr7850: nodulation protein nodG, 3-oxoacyl-(acyl carrier protein) reductase 52. mlr8749: GDP-L-fucose synthetase; nodulation protein; NolK 53. mlr8755: acyltransferase, nodulation protein; NodA 54. mlr8757: acetyltransferase, nodulation protein; NolL 55. mlr8764: nodulation protein; NolU 56. mlr9393: nodulation protein; NoeC 57. msr3202: integration host factor beta chain

Table2. Set of genes belonging to nitrogen metabolism functional category in M. loti.

1. mll0345: nitrogen regulatory protein P-II 2. mll1423: nitrile hydratase beta subunit 3. mll1425: nitrile hydratase alpha subunit 4. mll1732: naphthalene dioxygenase ferredoxin 5. mll3450: similar to nitrilase, nitrilase 1 like protein 6. mll4247: nitrogen regulatory protein P-II; GlnK 7. mll5100: Ferredoxin [2Fe-2S] I 8. mll6776: ornithine cyclodeaminase 9. mlr1320: nitrate/nitrite regulatory protein 10. mlr1729: ornithine cyclodeaminase 11. mlr2282: ornithine cyclodeaminase 12. mlr2862: nitrite reductase large subunit 13. mlr2863: nitrite reductase small subunit 14. mlr3204: ornithine cyclodeaminase; Ocd2 15. mlr3855: ferredoxin II 16. mlr4999: putative Rieske-like ferredoxin; MocE 17. mlr5869: ferredoxin 2[4Fe-4S] III; FdxB 18. mlr5930: ferredoxin 2[4Fe-4S] III; FdxB 19. mlr7139: ornithine cyclodeaminase 20. mlr7628: putative Rieske-like ferredoxin; MocE 21. msl0793: ferredoxin 22. msl8750: ferredoxin 2[4Fe-4S]; FdxN 21. msr9193: probable ferredoxin

17

Table3. Set of genes belonging to glutamate functional category in M. loti.

1. mll0151: histidine ammonia-lyase 2. mll0343: glutamine synthetase I 3. mll0601: proline iminopeptidase 4. mll1160: proline dehydrogenase 5. mll1557: UDP-N-acetylmuramoylalanine-D-glutamate ligase 6. mll1560: DP-N-acetylmuramoylalanyl-D-glutamate-2,6-diaminopimelate ligase 7. mll1631: N-carbamoyl-beta-alanine amidohydrolase 8. mll1646: glutamate synthase beta subunit 9. mll3029: glutamate synthase, small subunit 10. mll3030: glutamate synthase, large subunit 11. mll3040: N-formylglutamate amidohydrolase 12. mll3074: glutamine synthetase 13. mll3461: glutamate N-acetyltransferase/amino-acid acetyltransferase 14. mll4011: glutamate 5-kinase 15. mll4187: glutamine synthetase 16. mll5148: probable glutamine synthetase 17. mll6498: pyrroline-5-carboxylate reductase 18. mll6521: glutamine synthetase 19. mll7124: histidine ammonia-lyase 20. mll7254: glutamine synthetase 21. mll7307: glutamine synthetase III 22. mll7308: glutamate synthase large subunit 23. mll9226: argininosuccinate lyase 24. mlr0039: glutamate racemase 25. mlr0339: glutamine synthetase II 26. mlr1730: histidine ammonia-lyase 27. mlr3506: argininosuccinate lyase 28. mlr4366: argininosuccinate synthase 29. mlr4826: acetylglutamate kinase (EC 2.7.2.8) 30. mlr6209: histidine decarboxylase 31. mlr6210: glutamine synthetase III 32. mlr6298: gamma-glutamyl kinase 33. mlr7698: glutamate-ammonia-ligase; adenylyltransferase 34. mlr8321: histidine ammonia-lyase; HutH 35. mlr8322: N-formylglutamate amidohydrolase; HutG

18

Table 4. Set of genes belonging to nitrogen fixation functional category in M. loti.

1. mll1670: NtrR 2. mll1671: NtrP 3. mll3694: transcriptional regulator, similar to FixK-Bradyrhizobium japonicum 4. mll4698: two-component system histidine protein kinase (FixL like) 5. mll5421: aminotransferase; NifS 6. mll5837: Nif-specific regulatory protein; NifA 7. mll5855: nitrogen fixation protein; NifB 8. mll5857: nif-specific regulatory protein; NifA 9. mll5860: nitrogen fixation protein; FixC 10. mll5861: nitrogen fixation protein; FixB 11. mll5862: nitrogen fixation protein; FixA 12. mll5864: nitrogenase stabilizer; NifW 13. mll5865: nitrogenase cofactor synthesis protein; NifS 14. mll5941: nitrogen fixation protein; NifU 15. mll6578: nitrogen fixation regulation protein; FixK 16. mll6606: two-component, nitrogen fixation regulatory protein; FixJ 17. mll6607: two-component, nitrogen fixation sensor protein; FixL 18. mll6624: nitrogen fixation protein; FixI 19. mll6625: nitrogen fixation protein; FixH 20. mll6626: nitrogen fixation protein; FixG 21. mll6628: cytochrome-c oxidase FixP chain 22. mll6629: cytochrome-c oxidase FixO chain 23. mll6630: cytochrome-c oxidase FixN chain 24. mll8252: nitrogen fixation protein gene 25. mlr0015: nitrogenase cofactor synthesis protein; NifS 26. mlr0021: NifS-like aminotransferase 27. mlr0396: nitrogen reguration protein; NifR3 28. mlr0397: nitrogen reguration protein; NirB 29. mlr0398: nitrogen assimilation regulatory protein; NtrC 30. mlr0399: nitrogen regulation protein; NtrY 31. mlr0400: nitrogen assimilation regulatory protein; NtrX 32. mlr2864: nitrate reductase large subunit 33. mlr3659: histidine protein kinase, similar to FixL 34. mlr5785: Nif-specific regulatory protein; NifA 35. mlr5871: nitrogen fixation protein; NifQ 36. mlr5905: nitrogenase iron protein; NifH 37. mlr5906: nitrogenase molybdenum-iron protein alpha chain; NifD 38. mlr5907: nitrogenase molybdenum-iron protein beta chain; NifK 39. mlr5908: nitrogenase molybdenum-cofactor synthesis protein; NifE 40. mlr5909: nitrogenase molybdenum-iron protein beta chain; NifN 41. mlr5911: nitrogenase molybdenum-iron protein; NifX 42. mlr6097: nitrogen assimilation control protein 43. mlr6411: cytochrome-c oxidase FixN chain 44. mlr6412: cytochrome-c oxidase FixO chain 45. mlr6414: cytochrome-c oxidase FixP chain 46. mlr6415: nitrogen fixation protein; FixG 47. mlr6416: nitrogen fixation protein; FixH 48. mlr6417: nitrogen fixation protein; FixI 49. mlr7805: homocitrate syntase NifV candidate 50. msl5859: ferredoxin like protein; FixX 51. msl6623: nitrogen fixation protein; FixS 52. msl6627: nitrogen fixation protein; FixQ 53. msr6413: cytochrome-c oxidase FixQ chain 54. msr6418: nitrogen fixation protein; FixS

19

Table 5. List of identified known transcription factor/ binding site(s) in a set of M. loti symbiosis family genes upstream sequences.

S.No. Gene Description Site/TF Length

Position

Score

Gaps

Occurrence

E- value

1. mll1143 Nodulation protein N CAP/CRP-gal-1 11 109 8 0 1 4.05e-032. mll4296 Ferric leghemoglobin

reductase-2 precursor, dihydrolipoamide dehydrogenase

CAP/CRP-gal-1 11 40 8 0 1 1.20e-03CAP/CRP-CS-1 11 40 8 0 1 1.20e-03CAP/CRP-ara-2

11 40 11 0 1 3.52e-05

CAP/CRP-gal-2 11 40 8 0 1 1.20e-03CAP/CRP-lac 7 44 7 0 1 9.47e-03ExsACS(1) 8 63 6 0 1 3.69e-02ExsACS(2) 8 63 6 0 1 3.69e-02

3. mll4979 Similar to mocc (rhizopine catabolism), also similar to myo-inositol catabolism; iole

MalT-MalPp 6 152 6 0 1 8.50e-02MalT-CS 6 152 6 0 1 1.63e-01

4. mll5922 GDP-D-mannose dehydratase; nodulation protein; noel

Nitrogen regula

7 126 7 0 1 4.71e-02

CAP/CRP-lac 7 175 7 0 1 4.71e-02

5. mll6337 Nodulation protein; nolx

PhoP consensus

6 51 6 0 1 2.46e-01

6. mll6338 Nodulation protein; nolw

MalT-CS 6 197 6 0 1 1.92e-01

7. mll7567 Nodulation protein noek, phosphomannomutase

ExsACS(1) 8 152 6 0 1 1.73e-01ExsACS(2) 8 152 6 0 1 1.73e-01PhoP consensus

6 156 6 0 2 4.35e-01

PhoP consensus

6 373 6 0 1 4.35e-01

8. mll9683 Atse Lambda-C 8 176 8 0 1 7.26e-02MalT-CS 6 201 6 0 1 2.62e-01

9. mlr2192 O-acetyl transferase, nodl candidate

MalT-MalPp 6 386 6 0 1 1.76e-01MalT-CS 6 386 6 0 1 3.21e-01

10. mlr3097 Nodulation protein nodn-Rhizobium leguminosarum

MalT-MalPp 6 33 6 0 1 1.69e-02MalT-CS 6 33 6 0 1 3.36e-02

11. mlr5821 Nodulation protein; nodf

MalT-CS 6 269 6 0 1 2.44e-01

12. mlr5822 Nodulation protein; node

MalT-CS 6 84 6 0 1 3.21e-01

13. mlr5848 Nodulation protein; nodz

PhoP consensus

6 79 6 0 1 4.40e-01

NarL/NarP-CS 7 154 6 0 1 5.38e-0114. mlr6161 Methyltransferase,

nodulation protein; nods

MalT-CS 6 224 6 0 1 3.02e-01Nod-box 25 226 22 0 1 9.42e-11

15. mlr6175 Chitooligosaccharide deacetylase, nodulation protein; nodb

MomR/OxyR-CS

9 162 9 0 1 1.20e-02

16. mlr6386 Glutamine-fructose-6-phosphate transaminase nodulation protein; nodm

MomR/OxyR-CS

9 41 9 0 1 1.20e-02

MalT-MalPp 6 134 6 0 1 1.76e-01MalT-CS 6 134 6 0 2 3.21e-01Lambda-C 8 211 8 0 1 9.17e-02PhoP consensus

6 223 6 0 1 4.40e-01

MalT-CS 6 255 6 0 1 3.21e-01MalT-CS 6 268 6 0 1 3.21e-01

17. mlr7575 Nodulation protein nodp, sulfate adenylate transferase, subunit 2

CAP/CRP-lac 7 304 7 0 1 4.71e-02

18. mlr7780 Similar to Rhizopine catabolism protein mocd

ExsACS(1) 8 60 6 0 1 5.09e-02ExsACS(2) 8 60 6 0 1 5.09e-02

20

19. mlr7850 Nodulation protein nodg, 3-oxoacyl-(acyl carrier protein) reductase

PhoP consensus

6 26 6 0 1 7.74e-02

20. mlr8749 GDP-L-fucose synthetase; nodulation protein; nolk

PhoP consensus

6 66 6 0 1 9.35e-02

21. mlr8755 Acyltransferase, nodulation protein; noda

MomR/OxyR-CS

9 172 9 0 1 1.20e-02

MalT-MalPp 6 230 6 0 1 1.76e-01MalT-CS 6 230 6 0 1 3.21e-01PhoP consensus

6 387 6 0 1 4.40e-01

21

Table 6. List of identified known transcription factor/ binding site(s) in a set of M. loti nitrogen metabolism family genes upstream sequences.

S.No.

Gene Description Site/TF Length

Position

Score

Gaps Occurrence

E value

1. mll0345 Nitrogen regulatory protein P-II

MalT-MalPp 6 57 6 0 1 1.58e-01

MalT-CS 6 57 6 0 1 2.91e-01

Nitrogen regula

7 178 7 0 1 4.19e-02

MomR/OxyR-CS

9 201 9 0 1 1.06e-02

2. mll1732 Naphthalene dioxygenase ferredoxin

Nitrogen regula

7 13 7 0 1 2.92e-03

3. mll4247 Nitrogen regulatory protein P-II; glnk

PhoP consensus

6 192 6 0 2 3.72e-01

PhoP consensus

6 203 6 0 1 3.72e-01

4. mll5100 Ferredoxin [2fe-2s] i

MalT-CS 6 10 6 0 1 1.55e-02

5. mlr5869 Ferredoxin 2[4Fe-4S] III; fdxb

PhoP consensus

6 395 6 0 1 4.40e-01

6. mlr5930 Ferredoxin 2[4Fe-4S] III; fdxb

MalT-CS 6 212 6 0 1 3.21e-01

7. mlr7139 Ornithine cyclodeaminase

MalT-MalPp 6 86 6 0 1 1.76e-01

MalT-CS 6 86 6 0 1 3.21e-01

8. mlr7628 Putative Rieske-like ferredoxin; moce

MalT-MalPp 6 22 6 0 1 5.14e-02

MalT-CS 6 22 6 0 1 1.00e-01

22

Table 7. List of identified known transcription factor/ binding site(s) in a set of M. loti Glutamate family genes upstream sequences.

S.No.

Gene Description Site/TF Length

Position

Score

Gaps

Occurrence

E value

1. mll0151 Histidine ammonia-lyase

TATA-box-1 6 18 6 0 1 1..21e-01

2. mll0343 Glutamine synthetase I

MalT-MalPp 6 9 6 0 1 2.74e-02

MalT-CS 6 9 6 0 2 5.41e-02

MalT-CS 6 23 6 0 1 5.41e-02

ExsACS(1) 8 30 6 0 1 2.65e-02

ExsACS(2) 8 30 6 0 1 2.65e-02

3. mll1646 Glutamate synthase beta subunit

ExsACS(1) 8 7 6 0 2 5.46e-02

ExsACS(2) 8 7 6 0 2 5.46e-02

ExsACS(1) 8 18 6 0 1 5.46e-02

ExsACS(2) 8 18 6 0 1 5.46e-02

ExsACS(1) 8 68 6 0 1 5.46e-02

ExsACS(2) 8 68 6 0 1 5.46e-02

4. mll3030 Glutamate synthase, large subunit

PhoP consensus

6 105 6 0 1 4.40e-01

MalT-MalPp 6 388 6 0 1 1.76e-01

MalT-CS 6 388 6 0 1 3.21e-01

5. mll7254 Glutamine synthetase

ExsACS(1) 8 143 6 0 1 1.19e-01

ExsACS(2) 8 143 6 0 1 1.19e-01

6. mlr0039

Glutamate racemase

MalT-CS 6 35 6 0 1 3.21e-01

7. mlr1730

Histidine ammonia-lyase

Nitrogen regula

7 165 7 0 1 4.71e-02

8. mlr6209

Histidine decarboxylase

ExsACS(1) 8 250 6 0 1 1.75e-01

ExsACS(2) 8 250 6 0 1 1.75e-01

PhoP consensus

6 370 6 0 1 4.40e-01

9. mlr6298

Gamma-glutamyl kinase

MalT-CS 6 64 6 0 1 1.45e-01

10. mlr8321

Histidine ammonia-lyase; huth

MalT-CS 6 70 6 0 1 3.21e-01

23

Table 8. List of identified known transcription factor/ binding site(s) in a set of M. loti Nitrogen fixation family genes upstream.

S.No. Gene Description Site/TF Length

Position Score Gaps Occurrence

Exp value

1. mll3694 Transcriptional regulator, similar to fixk-Bradyrhizobium japonicum

MalT-MalPp 6 208 6 0 1 1.76e-01

MalT-CS 6 208 6 0 1 3.21e-01

ExsACS(1) 8 308 6 0 1 1.75e-01

ExsACS(2) 8 308 6 0 1 1.75e-01

2. mll4698 Two-component system histidine protein kinase (fixl like)

CAP/CRP-lac 7 213 7 0 1 4.27e-02

Nitrogen regula

7 342 7 0 1 4.27e-02

3. mll5855 Nitrogen fixation protein; nifb

PhoP consensus

6 226 6 0 1 3.03e-01

4. mll5857 Nif-specific regulatory protein; nifa

PhoP consensus

6 58 6 0 1 2.10e-01

ExsACS(1) 8 77 6 0 2 7.47e-02

ExsACS(2) 8 77 6 0 2 7.47e-02

ExsACS(1) 8 100 6 0 1 7.47e-02

ExsACS(2) 8 100 6 0 1 7.47e-02

MalT-MalPp 6 122 6 0 1 7.56e-02

MalT-CS 6 122 6 0 1 1.45e-01

5. mll5864 Nitrogenase stabilizer; nifw

MalT-MalPp 6 124 6 0 1 1.76e-01

MalT-CS 6 124 6 0 1 3.21e-01

MalT-MalPp 6 246 6 0 1 1.76e-01

MalT-CS 6 246 6 0 1 3.21e-01

6. mll6624 Nitrogen fixation protein; fixi

MalT-CS 6 221 6 0 1 3.21e-01

7. mll6630 Cytochrome-c oxidase fixn chain

ExsACS(1) 8 50 6 0 1 1.08e-01

ExsACS(2) 8 50 6 0 1 1.08e-01

PhoP consensus

6 54 6 0 1 2.91e-01

Nitrogen regula

7 106 7 0 1 2.81e-02

8. mll8252 Nitrogen fixation protein gene

CAP/CRP-ara-4

11 283 8 0 1 6.33e-03

CAP/CRP-ara-5

11 283 8 0 1 6.33e-03

9. mlr0396 Nitrogen reguration protein; nifr3

CAP/CRP-lac 7 28 7 0 1 2.35e-02

PhoP consensus

6 66 6 0 1 2.50e-01

PhoP consensus

6 85 6 0 1 2.50e-01

10. mlr0398 Nitrogen assimilation regulatory protein; ntrc

MalT-MalPp 6 188 6 0 1 1.76e-01

MalT-CS 6 188 6 0 1 3.21e-01

11. mlr0399 Nitrogen regulation protein; ntry

CAP/CRP-gal 11 59 8 0 1 3.09e-03

CAP/CRP-CS1

11 59 8 0 1 3.09e-03

24

CAP/CRP-gal2

11 59 8 0 1 3.09e-03

12. mlr0400 Nitrogen assimilation regulatory protein; ntrx

MalT-MalPp 6 243 6 0 1 1.76e-01

MalT-CS 6 243 6 0 1 3.21e-01

MalT-CS 6 320 6 0 1 3.21e-01

13. mlr3659 Histidine protein kinase, similar to fixl

PhoP consensus

6 27 6 0 1 1.58e-01

14. mlr5905 Nitrogenase iron protein; nifh

PhoP consensus

6 213 6 0 1 4.13e-01

15. mlr5911 Nitrogenase molybdenum-iron protein; nifx

Nitrogen regula

7 101 7 0 1 4.71e-02

16. mlr6097 Nitrogen assimilation control protein

MalT-MalPp 6 221 6 0 1 1.58e-01

MalT-CS 6 221 6 0 1 2.91e-01

17. mlr6411 Cytochrome-c oxidase fixn chain

CAP/CRP-gal3

11 36 8 0 1 1.94e-03

18. mlr6416 Nitrogen fixation protein; fixh

PhoP consensus

6 130 6 0 1 4.40e-01

19. mlr6417 Nitrogen fixation protein; fixi

MalT-CS 6 221 6 0 1 3.21e-01

20. mlr7805 Homocitrate syntase nifv candidate

NR(I)-4 12 229 9 0 1 1.72e-03

21. msl6623

Nitrogen fixation protein; fixs

PhoP consensus

6 208 6 0 1 4.40e-01

25

Table 9. In M. loti, detected known regulatory factors/ binding site(s) in all four studied set of genes belonging to functional category of Symbiosis, Nitrogen metabolism, Glutamate family and Nitrogen fixation respectively.

S. No.

Regulatory Family

RegulatoryFactor

Length

Known TF binding sites

Functional category of studied genes set of M.lotiSymbiosis N2-metabolism Glutamate family N2-fixation

1. CAP/CRP CRP-gal-1 11 AAGATGCGAAA mll1143 mll4296 -- -- --CAP/CRP-gal-2 11 AAAGTGTGACA mll4296 -- -- mlr0399CRP-gal-3 11 TCCATGTCACA -- -- -- mlr6411CRP-ara 11 AAAGCGCTACA mll4296 -- -- mll8252CAP/CRP-lac 7 ACACTTT mll5922 mlr7780

mll4296 mlr7575-- -- mll4698

mlr03962. MalT MalT-CS protein 6 GGAKGA mll4979 mll6337

mlr2192 mll9683mlr3097 mlr5821mlr5822 mlr6161mlr6386 mlr8757mll6338 mlr8755

mll0345 mll5100mlr5930 mlr7139mlr7628

mll0343 mll3030mlr0039 mlr6298mlr8321

mll3694 mll5857mll5864 mll6624mlr0398 mlr0400mlr6097 mlr6417

3. PhoP PhoP_cons protein

6 TTHACA mll6337 mll7567mlr5848 mlr6386mlr7850 mlr8749mlr8757 mlr8755

mll4247 mlr5869 mll3030 mlr6209 mll5855 mll5857mll6630 mlr0396mlr3659 mlr5905mlr6416 msl6623

4. ExsA ExsA-Cs protein 8 TNAAAANA mll7567 mlr7780mll4296

-- mll0343 mll1646mll7254 mlr6209

mll3694 mll5857mll6630

5. NOD NOD-Box protein 25 ATCCAAACAATCRATTTTACCAATC

mlr6161 -- -- --

6. MomR/oxyR MomR/oxyR CS protein

9 ATGCATCRW mlr6175 mlr6386mlr8757 mlr8755

mll0345 -- --

7. Nitrogen regulatory site*

Unknown TF 7 TTTTGCA mll5922 mll0345 mll1732 mlr1730 mll4698 mll6630mlr5911

8. MALT-MALPp site*

Unknown TF 6 TCCTCC mlr2192 mlr3097mlr6386 mlr8757mll4979 mlr8755

mll0345 mlr7139mlr7628

mll0343 mll3030 mll3694 mll5864mlr0398 mlr0400mlr6097 mll5857

9. Lambda site* Unknown TF 8 GGYGTRYG mll9683 mlr6386 -- -- --10. TATA box TATA- box

protein6 TATAAT -- -- mll0151 --

26

11. NR(I) NR-I protein 12 GCACGATGGTGC -- -- -- mlr780512. NarL/NarP NarL/NarP-CS

protein7 TACYNMT mlr5848 -- -- --

Table 10. Details of known regulatory transcription factor (TF)/ binding site(s) detected in studied genes set of M. loti.

*= Known sites with unknown TF

S. No.

TF Factor/ site*

Reference No. Known site Length Description

1 CAP/CRP Taniguchi et al., (1979) AAGATGCGAAAAAAGTGTGACATCCATGTCACAAAAGCGCTACAACACTTT

111111117

Binding site for CAP/CRP complex in lac-operon of E.coli. cAMP-CRP play an important role in transcription initiation.

2 MalT-CS Raibaud et al., (1985) GGAKGA 6 Binding site of malt on malPp promoter in E.coli.3 PhoP Eder et al., (1999) TTHACA 6 Controls the phosphate deficiency response in Bacillus

subtilis, & required for activation or repression of Pho regulon genes.

4 ExsA Hovey AK & Frank DW, (1995) TNAAAANA 8 A central regulator of exoenzymeS production by Pseudomonas aeruginosa.

5 NOD-Box Fisher et al., (1988) ATCCAAACAATCRATTTTACCAATC

25 Binding site of NodD protein, upstream of inducible nodulation genes in Rhizobium meliloti.

6 MomR/oxyR CS Bolker M & Kahmann R, (1989) ATGCATCRW 9 The E.coli regulatory protein OxyR discriminates between methylated & unmethylated states of the phage Mu mom promoter.

7 Nitrogen-regulatory site*

Ow et al., (1983) TTTTGCA 7 Promoters regulated by the glnG (ntrC) and nifA gene products. Known site only.

8 Lambda-C site*

Zhou et al., (2002) GGYGTRYG 8 Transcription antitermination by the bacteriophage lambda N protein is stimulated in vitro by the Escherichia coli NusG protein. Known site only.

9 TATA- Box Siegele et al., (1989) TATAAT 8 Binding site of RNA polymerase, sigma70 subunit in E.coli.

10 NR(I) Reitzer LJ & Magasanik B (1986) GCACGATGGTGC 12 A regulatory protein, stimulate transcription at N2-regulated promoter glnAp2, act as enhancer in E.coli.

11 NarL/NarP-CS Householder et al., (1999) TACYNMT 7 The gonococcal FNR and NarP homologs are involved in the regulation of aniA. AniA (formerly Pan1) is the major anaerobically induced outer membrane protein in Neisseria gonorrhoeae.

27

12 MalT_malPp* Raibaud et al., (1985) TCCTCC 6 Conserved known site in malPp, a positively controlled promoter in Escherichia coli. Known site only.Binding site of malt on malPp promoter in E.coli.

28

Table 11. List of identified M.loti potential upstream regulatory hexanucleotide patterns in all sets of genes detected through oligonucleotide analysis.

HexanucleotidePattern

Sequence ID

Strand Start position

End position

Matching word Score

Functional gene category: Glutamate familyAAACGT mll0343 D -44 -39 caaaAAACGTcatc 1.12

mll0343 D -27 -22 caaaAAACGTcatc 1.12mll0601 D -19 -14 gtttAAACGTgatg 1.12mll1646 D -44 -39 aataAAACGTgcta 1.12mll3030 R -348 -343 gagaAAACGTaagc 1.12

ATAAAA mll1560 R -24 -19 cgcaATAAAAtagc 2.70mll1646 D -55 -50 attgATAAAAaaat 2.70mll1646 D -47 -42 aaaaATAAAAcgtg 2.70mll1646 R -101 -96 cttgATAAAAacat 2.70mll3030 D -260 -255 tgccATAAAAattt 2.70mll3030 R -324 -319 cgacATAAAAtttc 2.70

GGGATA mll3040 R -161 -156 ggacGGGATAtc 0.42mll3074 R -11 -6 gaaaGGGATAaacg 0.42mll7254 D -54 -49 atcaGGGATAgcgc 0.42mll7254 R -141 -136 agctGGGATAtcgg 0.42

GGCAGA mll9226 D -79 -74 ctcaGGCAGAacgg 0.32mlr0039 D -38 -33 ggaaGGCAGAcgcc 0.32mlr0039 D -385 -380 ttgaGGCAGAtttg 0.32mlr0039 D -56 -51 cgacGGCAGAcgtg 0.32mlr1730 D -124 -119 ggcaGGCAGAgcga 0.32mlr1730 R -375 -370 ctccGGCAGAcgcg 0.32mlr3506 R -44 -39 aaatGGCAGAcgga 0.32mlr6209 D -354 -349 tggaGGCAGAacct 0.32

Functional gene category: Nitrogen fixationAATTCG mll3694 D -358 -353 ggcgAATTCGagcg 0.28

mll3694 D -199 -194 gaggAATTCGtcct 0.28mll3694 D -179 -174 caagAATTCGcaat 0.28mll3694 D -111 -106 tcgcAATTCGgcgc 0.28mll3694 R -360 -355 ctcgAATTCGccgc 0.28mll3694 R -309 -304 acgcAATTCGacgc 0.28mll4698 R -297 -292 gttcAATTCGcagg 0.28mll4698 R -268 -263 ctacAATTCGtcgg 0.28

CAGGGA mlr3659 D -7 -2 caagCAGGGAt 0.10mlr5871 R -282 -277 cgtcCAGGGAgagc 0.10mlr5906 D -92 -87 ccgaCAGGGAggcg 0.10mlr5907 D -111 -106 caggCAGGGAggcc 0.10mlr5907 R -122 -117 ctgcCAGGGAggcc 0.10mlr5907 R -46 -41 agcaCAGGGAcatc 0.10

CGATCG msl6623 D -340 -335 gtctCGATCGcacc 1.39msl6623 D -205 -200 atcgCGATCGccat 1.39msl6623 D -133 -128 gcggCGATCGccat 1.39msl6623 D -8 -3 ggccCGATCGtc 1.39msr6418 D -337 -332 gtctCGATCGcgcc 1.39msr6418 D -196 -191 attgCGATCGtcta 1.39msr6418 D -130 -125 gcggCGATCGcgat 1.39

Functional gene category: Nitrogen metabolism GAGCAC mll1423 D -341 -336 gggcGAGCACatgc 1.66

mll1423 R -282 -277 gcagGAGCACagcg 1.66mll1423 R -260 -255 ggccGAGCACcgac 1.66mll1423 R -188 -183 cctcGAGCACgccg 1.66mll4247 D -190 -185 ttttGAGCACatgg 1.66mll4247 R -239 -234 ctttGAGCACgatc 1.66mlr1320 D -150 -145 gaaaGAGCACcccg 1.66

Functional gene category: SymbiosisCCCCCA mll1107 R -95 -90 agtgCCCCCAgtct 0.64

mll4979 D -171 -166 gcagCCCCCAcctc 0.64mll4979 D -111 -106 cgcaCCCCCAcccc 0.64

29

mll4979 D -47 -42 cggaCCCCCAcaag 0.64mll4979 R -128 -123 ccgaCCCCCAcccg 0.64

CCCCAC mll4979 D -170 -165 cagcCCCCACctcc 0.62mll4979 D -110 -105 gcacCCCCACcccg 0.62mll4979 D -46 -41 ggacCCCCACaagg 0.62mll4979 R -155 -150 acctCCCCACaagg 0.62mll4979 R -129 -124 cgacCCCCACccgg 0.62

ATTACC mll9683 R -291 -286 catgATTACCgcga 0.15mlr2437 D -307 -302 cccgATTACCgtga 0.15mlr5801 D -392 -387 atccATTACCcaag 0.15mlr5801 D -38 -33 caacATTACCccac 0.15

AGCTTG mlr6175 R -129 -124 atgcAGCTTGcgcc 0.19mlr6175 R -101 -96 ggggAGCTTGtcgc 0.19mlr6341 D -322 -317 cttaAGCTTGtctc 0.19mlr6622 R -207 -202 tgtcAGCTTGctc 0.19mlr6622 R -178 -173 tcgcAGCTTGagct 0.19mlr7575 D -336 -331 gcggAGCTTGcagt 0.19mlr7575 D -78 -73 gcgaAGCTTGaacc 0.19

30

Table 12. Clustering of M.loti genes along with corresponding identified upstream hexanucleotide patterns in each functional category genes sets with significant statistical values.

Functional category

Genes Pattern Observed frequency

Expected frequency

Occ Sig value Ms

1.Symbiosis mll1107 mll4979

CCCCCA 0.00343 0.000317 5 0.64 2

mll4979* CCCCAC 0.00343 0.000321 5 0.62 1mll9683* mlr2437 mlr5801

ATTACC 0.00210 0.00017 4 0.15 3

mlr6175* mlr6341 mlr6622 mlr7575

AGCTTG 0.003419 0.000618 7 0.19 4

2.Nitrogen metabolism

mll1423 mll4247 mlr1320

GAGCAC 0.004149 0.000435 7 1.66 3

3.Glutamic acid family

mll3030* mll1646* mll0343* mll0601

AAACGT 0.004039 0.000295 5 1.12 4

mll3030* mll1560 mll1646*

ATAAAA 0.00485 0.000257 6 2.70 3

mll3040 mll3074 mll7254*

GGGATA 0.004124 0.000286 4 0.42 3

mll9226 mlr0039* mlr0339 mlr1730* mlr3506 mlr6209*

GGCAGA 0.003269 0.000654 8 0.32 6

4.Nitrogen fixation

mll3694* mll4698*

AATTCG 0.003812 0.000667 7 0.28 2

mlr3659* mlr5871 mlr5906 mlr5907

CAGGGA 0.003193 0.000497 6 0.10 4

msl6623* msr6418

CGATCG 0.008480 0.000812 6 1.39 2

Abbreviations: ms= number of matching sequences, i.e. the number of sequences from the family which contain at least one occurrence of the pattern, occ= number of occurrences of the pattern among all upstream regions from the family, sig= significance coefficient or index value. Here ‘*’ means significant gene with known TF/site in their upstream sequence.

31

Table 13. Details of M. loti hexanucleotides statistical data resulted in all functional category genes sets.

S.No.

Seq. Identifier Observed frequency

Expected frequency

Occ Exp_occ

Occ_P Occ_E Occ_sig

Z score

ratio ms

Exp_ms

Ms_P Ms_E

Ms_sig

Ms_freq

Exp_msf

Functional gene category: Glutamic acid family1. ataaa

aAtaaaa/ttttat 0.00484

60.0002574

6 0.31 9.5e-07 2.0e-03 2.70 10.20 18.83 3 0.31 0.00314 6.5 -0.8 0.3 0.03137

2. aaacgt

Aaacgt/acgttt 0.004038

0.0002959

5 0.36 3.7e-05 7.6e-02 1.12 7.75 13.65 4 0.36 0.00030 0.6 0.2 0.4 0.03597

3. gggata

Gggata/tatccc 0.004123

0.0002863

4 0.27 0.00018 3.8e-01 0.42 7.15 14.40 3 0.27 0.00213 4.4 -0.6 0.3 0.02736

4. ggcaga

Ggcaga/tctgcc 0.003269

0.0006542

8 1.57 0.00023 4.8e-01 0.32 5.12 5.00 6 1.48 0.00129 2.7 -0.4 0.6 0.14798

Functional gene category: Nitrogen fixation5. aattc

gAattcg/cgaatt 0.00381

30.0006673

7 1.20 0.00025 5.2e-01 0.28 5.29 5.71 2 1.15 0.3231 6.7e+02

-2.8 0.2 0.1152

6. Caggga

Caggga/tccctg 0.003193

0.0004970

6 0.92 0.00039 8.0e-01 0.10 5.29 6.42 4 0.89 0.0085 1.8 -1.3 0.4 0.0892

7. Cgatcg

Cgatcg/cgatcg 0.008486

0.0008128

6 0.53 1.9e-05 4.0e-02 1.39 7.51 10.44 2 0.55 0.0972 2e+02

-2.3 0.3 0.0911

Functional gene category: Symbiosis8. ccccc

aCcccca/tggggg 0.00343

40.0003178

5 0.45 0.00011 2.3e-01 0.64 6.74 10.80 2 0.45 0.0721 1.5e+02

-2.2 0.2 0.0451

9. ccccac

Ccccac/gtgggg 0.003434

0.0003212

5 0.46 0.00012 2.4e-01 0.62 6.70 10.69 1 0.46 0.3733 7.8e+02

-2.9 0.1 0.0456

10. attacc

Attacc/ggtaat 0.002104

0.0001702

4 0.32 0.00034 7.0e-01 0.15 6.50 12.36 3 0.32 0.0033 6.8 -0.8 0.3 0.0318

11. agcttg

Agcttg/caagct 0.003419

0.0006178

7 1.24 0.00031 6.4e-01 0.19 5.16 5.53 4 1.19 0.0231 48 -1.7 0.4 0.1186

Functional gene category: Nitrogen metabolism 12. gagca

cGagcac/gtgctc 0.00414

90.0004352

7 0.72 1e-05 2.2e-02 1.66 7.41 9.53 3 0.71 0.02908 60 -1.8 0.3 0.0707

Abbreviation: observed_freq= observed relative frequency, expected_freq= expected relative frequency, occ= observed occurrences, exp_occ= expected occurrences, occ_P= occurrence probability (binomial), occ_E= E-value for occurrences (binomial), occ_sig= occurrence significance (binomial), zscore= z-score (normal), ovl_occ= number of overlapping occurrences (discarded from the

32

count), ratio = observed/expected ratio, ms= number of matching sequences, exp_ms= expected number of matching sequences, ms_P= matching sequence probability (binomial), ms_E= E-value for matching sequences (binomial), ms_sig= matching sequenc significance (binomial),ms_freq= observed matching sequence frequency, exp_msf=expected matching sequence frequency, ms_rati= observed/expected matching seqyences.

33

Table 14. Details of common genes having both known TF binding sites and hexanucleotide patterns in their upstream sequences detected through Tfsitescan & RSAT (Oligonucleotide analysis tool) respectively, identified in each functional category genes sets of M. loti. For each TF family, all hexanucleotides with a positive significance coefficient (sig0) were clustered on the basis of hexanucleotide pattern sequences similarity and corresponding known TF binding sites. Substitutions within a cluster were not overlapped. Here higher significance coefficient (sig) value from each TF/binding site family generally corresponds to the known sites.

S. No.

TF/binding site* family

Hexanucleotide analysis result Known TF binding site Gene functional categoryPattern sequence

Ms Occ Exp Sig Consensus Bound factor/ or site

N2-fixation

Symbiosis

N2-met

Glutamate

1. CAP/CRP aattcg 2 7 1.20 0.28 ACACTTT CAP/CRP-Lac mll4698 -- -- --2. MalT ccccac 1 5 0.46 0.62

GGAKGA Malt_Cs-- mll4979 -- --

ggcaga 6 8 1.57 0.32 -- -- -- mlr0039aattcg 2 7 1.20 0.28 mll3694 -- -- -- attacc 3 4 0.32 0.15 -- mll9683 -- --ataaaa 3 6 0.31 2.70 -- -- -- mll3030 aaacgt 4 5 0.36 1.12 -- -- -- aaacgt 4 5 0.36 1.12 -- -- -- mll0343

3. PhoP ataaaa 3 6 0.31 2.70

TTHACA PhoP

-- -- -- mll3030 aaacgt 4 5 0.36 1.12 -- -- --

cgatcg 2 6 0.53 1.39 msl6623 -- -- --ggcaga 6 8 1.57 0.32 -- -- -- mlr6209 caggga 4 6 0.92 0.10 mlr3659 -- -- --

4. ExsA ataaaa 3 6 0.31 2.70 TNAAAANA ExaA_Cs_(1)ExaA_Cs_(2)

-- -- -- mll1646 aaacgt 4 5 0.36 1.12 -- -- -- aaacgt 4 5 0.36 1.12 -- -- -- mll0343

gggata 3 4 0.27 0.42 -- -- -- mll7254ggcaga 6 8 1.57 0.32 -- -- -- mlr6209

5. MalT_malPp*ataaaa 3 6 0.31 2.70 TCCTCC MalT_malPp site* -- -- --

mll3030 aaacgt 4 5 0.36 1.12 -- -- -- aaacgt 4 5 0.36 1.12 -- -- -- mll0343

6. MomR/oxyR agcttg 4 7 1.24 0.19 ATGCATCRW MomR/oxyR -- mlr6175 -- --7. Nitrogen_reg* ggcaga 6 8 1.57 0.32 TTTTGCA Nitrogen_reg

site*-- -- -- mlr1730

34

8. Lambda* attacc 3 4 0.32 0.15 GGYGTRYG Lambda_C site* -- mll9683 -- --

Abbreviations: ms= umber of matching sequences, i.e. the number of sequences from the family which contain at least one occurrence of the pattern, occ= number of occurrences of the pattern among all upstream regions from the family, exp= expected number of occurrences, sig= significance index or coefficient value, calculated as defined in Methodology. Here * means known site found through ooTFD database.

35