1 Copyright © Gennotate development team
User Guide
Written By
Yasser EL-Manzalawy
2 Copyright © Gennotate development team
Introduction
As large amounts of genome sequence data are becoming available nowadays, the
development of reliable and efficient genome annotation tools required to assign biological
interpretation to the DNA sequence becomes more desirable. Although several
computational genome annotation tools have been proposed, accurate and scalable
genome annotation remains a major challenge.
A variety of knowledge-based, statistical, and machine learning methods have been
developed for many genome annotation tasks. They differ in terms of the training data sets
used to train the predictive models, the data representations (e.g., sequence features) used
for encoding the inputs and outputs (class labels) of the predictive models, the algorithms
used for building the predictors, and the validation data sets and the performance metrics
used to assess the effectiveness of the predictors. Often, it is the case that the data sets,
implementations of algorithms, the data representations used, are simply not available to
the research community in a form that allows rigorous comparison of alternative
approaches. Yet, such comparisons are essential for determining the strengths and
limitations of existing approaches so that further research can be focused on improving
these methods. For example, some of the methods are accessible via the Internet as online
Web servers. Comparison of the underlying computational methods implemented by such
servers is not straightforward in the absence of access to implementations of the
algorithms and the precise data sets and data representations used. This is further
complicated by the fact that some of the servers often update the predictors periodically
using newly available data, newer computational methods, or data representations, making
it difficult to determine whether the reported or measured changes in predictive accuracy
stem from improvements in the methods, data representations, or better data sets.
What is Gennotate?
Gennotate is a platform for sharing data representations, predictors, and machine learning
algorithms for a broad range of gene structure prediction tasks.
Gennotate has two main components (see Figure 1):
1) Model builder, an application for building and evaluating predictors and serializing
these models in a binary format (model files).
2) Predictor, an application for applying a model to test data (e.g., sequences to be
annotated). The model builder application is an extension of WEKA [1], a widely
used machine learning workbench supporting many standard machine learning
algorithms. WEKA provides tools for data pre-processing, classification, regression,
clustering, validation, and visualization. Furthermore, WEKA provides a framework
for implementing new machine learning methods and data pre-processors. The
model builder extends WEKA by adding a suite of data pre-processors (called filters
3 Copyright © Gennotate development team
in WEKA) for converting molecular sequences into vectors of numerical features
such that WEKA supported methods can be applied to the data. The current
implementation supports filters for generating several of the widely used data
representations of molecular sequences. Once the sequences are converted into
numeric or nominal features, any suitable WEKA learner can be trained and
evaluated on that data set.
Model builder
The model builder extends WEKA with a variety of DNA sequence preprocessors (WEKA
filters) and a number of classification algorithms (e.g., classifiers based on Markov models).
The very few exceptions, machine learning algorithms supported in WEKA can not be
directly applied to DNA sequence data. A preprocessing step that extracts some features
from sequence data is often required. Gennotate model builder provides more than 30
implemented sequence and structure –based DNA feature extraction methods.
Additionally, a filter called ConcatenateFilter generates new features based on the
combination of any set of Gennotate features. Table 1 summarizes the list of currently
implemented Gennotate filters. For detailed information about these filters, please check
the Gennotate API documentation available at
http://ailab.cs.iastate.edu/gennotate/javadoc/index.html.
Once the features have been extracted from DNA sequences, many WEKA supported
machine learning algorithms can be applied (including state-of-the-art algorithms for
classification, regression, clustering, and feature selection). In addition to WEKA supported
implementation, Gennotate can run any third party extension of WEKA. The procedure is as
a simple as just adding some extra jar files to our CLASSPATH when running Gennotate.
Figure 1: Gennotate model builder (left) and predictor (right).
4 Copyright © Gennotate development team
The current implementation of Gennotate enriches WEKA with a number of classification
algorithms summarized in Table 2.
Table 1: List of Gennotate supported filters
Filter Description
DDNAFilter A filer for extracting dinucleotide structure features from DNA sequences.
DNA2Filter A filer for converting DNA sequence into a new sequence over a new
alphabet defined over all dinucleotide symbols.
DNASeqToNominalFilter A Filter to convert a string attribute of DNA sequence into nominal
attributes.
DNCFilter A Filter to convert a string attribute into 400 features representing
compositions of dinucleotides.
KMerFilter A Filter to convert a string attribute into numeric features represented as
the frequency of its k-mer substrings.
MonoHBondFilter A Filter for extracting hydrogen bond based DNA structure features.
NCFilter A Filter to convert a string attribute into numeric features representing
compositions of nucleotides.
SubSequenceFilter A Filter for extracting a substring from the DNA sequence.
TRIDNAFilter A filer for extracting tri-nucleotide structure features from DNA
sequences.
ConcatenateFilter A filter for concatenating multiple filters.
Table 2: List of Gennotate classification algorithms
Classifier Description
HMMClassifier A classifier for implementing a Hidden Markov Model from sequence data.
IMMClassifier A classifier for implementing an Interpolated Markov Model from sequence data.
MMClassifier A classifier for implementing a Markov Model from sequence data.
BalancedClassifier A meta classifier for training a base classifier on unbalanced data set.
ModelBased A meta classifier for performing classification/regression using a specified model
file.
Predictor
The Predictor is a graphical user interface (GUI) for applying a saved prediction model to a
test datasets. Specifically, the user inputs the model file, the test data file, the output file
name, the format of the test data (DNA fragments (one fragment per line) or fasta
sequences), the type of the problem (peptide-based or nucleotide-based), and the
length of the peptide/window sequence. The output of the predictor is a summary of the
input model (model name, model parameters, and the name of the datasets used to build
the model) followed by the predictions. The predictions are four tab-separated columns
(See Figure 2). The first column is the sequence identifier. The second and third columns
are position and the sequence of the predicted peptide/nucleotide sequence. The last
column is the predicted scores.
5 Copyright © Gennotate development team
Installing and running Gennotate
Gennotate is platform-independent since it is implemented in Java. For Installing
Gennotate, one needs to download it from the project web site and unzip the
compressed file. For running Gennotate, you need to add all the jar files included in the lib
folder to the CLASSPATH and run the gennotate.jar file.
For example, the following command sets the CLASSPATH and runs Gennotate on Windows
machines:
java -Xmx1024m -classpath "gennotate.jar;weka.jar" gennotate.gui.MainGUI
For linux machines, replace “;” with “:”.
Using Gennotate
In this section, we show several examples on how to use Gennotate to develop predictors
from DNA sequence data. For this purpose, we use two in-house data sets for predicting
sigma 70 promoters in E. coli:
1) Sigma70.arff is a non-redundant data set extracted from RegulonDB on June 24,
2013. The data set contains 579 promoter sequences published before April 2009.
None of 579 shares more than 45% similarity with any other sequence in the
Figure 2: Example Predictor output.
6 Copyright © Gennotate development team
promoter data. There are also 579 non-promoter sequences in which none of them
shares more than 45% with any either promoter or non-promoter sequence.
2) Sigma70_test is a non-redundant data set extracted from RegulonDB on June 24,
2013. All promoter sequences are published after April 2009. The data has 792
promoters and 792 non-promoters sequences. None of the sequences shares more
than 45% with any other sequence. The test data is provides in two formats: 1)
standard WEKA format (file Sigma70_test.arff); 2) one fragment per line format (file
Sigma70_test.txt).
Building your first predictor
Here, we show how to build your first predictor using Sigma70.arff data and HMMClassifier
and store it for future use on test data.
1. Run Gennotate
2. Go to Application menu and select model builder application.
3. In the model builder window (WEKA explorer augmented with Gennotate filters and
prediction methods) click open and select the file /Example/Data/Sigma70.arff.
4. Click classify tab
5. In classifier panel click choose and browse for HMMClassifier
6. The HMMClassifier has two parameters: input data alphabet (default ACGTN); and
whether the input sequences has gaps (default false). Keep the default parameters
and click OK.
7. Having both the data set and the classification algorithm specified, we are ready to
build the model and evaluate it using 10-fold cross-validation. Just click start button
and wait for the 10-fold cross-validation procedure to finish. The classifier output
shows several statistical estimates of HMMClassifier using 10-fold cross-validation.
For example, the accuracy and AUC of the model are 72.8% and 0.81, respectively.
8. To save the model, right click on the model in the Result list panel and select Save
model. Save your model as /Examples/Models/Sigma70HMM.model.
7 Copyright © Gennotate development team
Applying your model to test data
There are several methods for applying your model on a test data. First, if your test data are
stored in a WEKA format, then you can use the model builder directly to apply the model to
test data and get predictions and some performance measures. To do that following the
following steps:
1. In Test options panel, click supplied test data and click Set to specify the test data file
/Examples/Models/Sigma70_test.arff
2. Right click on Result list panel and select load model to load
/Examples/Models/Sigma70HMM.model. After successfully loading the model, the
classifier output shows information about the training data, the algorithms and its
parameters.
8 Copyright © Gennotate development team
3. By default, WEKA explorer does not output predictions. To output predictions, click
More options and mark output predictions option.
4. Click Start. Wait for evaluating the model on the test data. Then the classifier output
panel will display the predictions and some performance evaluation measures.
Second, if your test data are in Gennotate supported formats (e.g., FASTA or single DNA
fragment per line), then you can use the Predictor application to apply a saved model and
get predictions. For example, to apply Sigma70HMM.model to the test data in
/Examples/Data/Sigma70_test.txt data, follow the following steps:
1. Run Predictor from Application menu.
9 Copyright © Gennotate development team
2. Specify, your inputs and output file as in the below figure.
3. Click Predict and wait to see the output in the Predictions panel and also in the
output file /Examples/Output/Sigma70_test_out.txt.
10 Copyright © Gennotate development team
Case Study 1: Predicting promoter
regions in E.coli using sequence and
structure features
In the previous section, we show how to build a HMM model for predicting Sigma70
promoters in E.coli. A major difference between HMMClassifier and traditional
classifiers such as Naïve Bayes (NB) and Random Forest is that HMMClassifier can
be applied directly to sequence data while traditional classifiers expect the data to
be in the form of feature vectors extracted from the original sequence data. Here, we
show how to simultaneously, extract features from sequence data and build/test a
model. Thanks to WEKA FilteredClassifier which allows us to specify a machine
learning algorithm and a filter to be applied on the fly before feeding the data to the
predictor.
To build a NB classifier using 3-mer features, follow these steps:
1. Run Gennotate
2. Go to Application menu and select model builder application.
3. In the model builder window (WEKA explorer augmented with Gennotate filters
and prediction methods) click open and select the file
/Example/Data/Sigma70.arff.
4. Click classify tab.
5. In classifier panel click choose and browse for
weka.classifiers.meta.FilteredClassifier.
6. Click on the classifier schema in classifier panel to get the following window.
11 Copyright © Gennotate development team
7. Change the classifier to weka.classifiers.bayes.NaiveBayes (with its default
parameters) and the filter to gennotate.filters.unsupervised.KMerFilter (set k
parameter to 3). Click OK.
8. Click Start to run the 10-fold cross-validation experiment. The following figure
shows the result of our experiment.
You can repeat the preceding procedure for different choices of classifiers and Gennotate
filters. Table 3, compares Naïve Bayes (NB) and Random Forest (with 50 trees) (RF50) for
k = 1,2,3, and 4. Interestingly, none of the classifiers has a competitive performance with
the HMM classifier which achieved AUC equals 0.81 on the same data set.
12 Copyright © Gennotate development team
Table 3: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data
using different sequence-based features.
Features NB RF50
1-mer 0.64 0.58
2-mer 0.65 0.66
3-mer 0.65 0.67
4-mer 0.65 0.66
To build models using structure features, follow the preceding procedure and replace
KMerFilter with DDNAFilter which allows us to experiment with 12 different dinucleotide
structure-based features [2] (See Gennotete API documentation for detailed information
about these methods). Table 4, compares NB and RF50 using 10-fold cross-validation and
different structure-based features extracted from Sigma70.arff data. In several cases,
structure-based features helped us to reach a performance that competitive with HMM
classifier. RF50 seems to be doing better than NB. However, it should be noted that the
number of trees were arbitrary set to 50. There could be room for potential improvements
using larger numbers of trees (we leave this as an exercise for the user). For future
experiments, let’s save the best model in Table 4 as
/Examples/Models/Sigma70_Stability_RF50.model.
Table 4: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data
using twelve different dinucleotide structure-based features.
Features NB RF50
DI_APHYLICITY 0.68 0.68
DI_BDNATWISTOHLER 0.62 0.68
DI_BDNATWISTOLSON 0.64 0.77
DI_DNABENDSTIFF 0.78 0.76
DI_DNADENATURE 0.76 0.77
DI_ZDNASTABENERGY 0.78 0.79
DI_DUPLEXSTAB_DISRUPTENERGY 0.74 0.78
DI_DUPLEXSTAB_FREEENERGY 0.77 0.77
DI_PINDUCEDDEFORM 0.74 0.77
DI_PROPELLERTWIST 0.75 0.75
DI_PROTEINDNATWIST 0.64 0.65
DI_STACKINGENERGY 0.77 0.77
13 Copyright © Gennotate development team
Case Study 2: Improved prediction of
promoter regions in E.coli
In case study 1, we evaluated the prediction of sigma 70 promoters in E.coli using twelve
different methods for extracting dinucleotide features. In general, a better performance can
be achieved by: 1) combing a set of these features; 2) building an ensemble of classifiers
where each base classifier is trained using different structure-based feature; 3) combining
all 12 sets of structure features and using a feature selection method to find an optimal
subset of features. Here, we show how to use Gennotate to build improved methods using
these three approaches.
Concatenating features
To build a single classifier that takes as input the features extracting using the twelve different
dinucleotide features, use gennotate.filters.ConcatenateFilter.
1. Run Gennotate
2. Go to Application menu and select model builder application.
3. In the model builder window (WEKA explorer augmented with Gennotate filters
and prediction methods) click open and select the file
/Example/Data/Sigma70.arff.
4. Click classify tab.
5. In classifier panel click choose and browse for
weka.classifiers.meta.FilteredClassifier.
6. Click on the classifier schema in classifier panel to get the following window.
14 Copyright © Gennotate development team
7. Change the classifier to weka.classifiers.trees.RandomForest (set the number of
trees to 50) and the filter to gennotate.filters.ConcatenateFilter
8. Click on the ConcatenateFilter and input the twelve filters (e.g., DDNAFilter with
twelve different selections of ConversionTable parameter.
15 Copyright © Gennotate development team
9. Click Start to run the 10-fold cross-validation experiment. The following figure
shows the result of our experiment.
The following figure shows the cross-validation performance of the predictor using twelve
combined sets of structure features. The result is better than any single set of structure
features.
16 Copyright © Gennotate development team
Concatenating features and selecting optimal subset of features
In the preceding experiment, we show that working with a concatenation of twelve set of features can
improve the performance of the resulting model. In general, this high dimensional feature space might
have several irrelevant and/or redundant features. Here, we show how to use WEKA feature selections
with our ConcatenateFilter to further improve the performance of the resulting model.
1. Follow steps 1-6 in the previous experiment
2. In FilteredClassifier window, choose RF50 as your classifier and choose
weka.classifiers.AttributeSelection as your filter.
3. Click on MultiFilter and input two Filters: i) ConcatenateFilter with twelve
DDNAFilters, each with different choice of ConversionTable; ii) AttributeSelection
filter.
4. For AttributeSelection filter, you can play with large pool of WEKA provided feature
selection methods and search algorithms. For our experiment, we ranked all
features based on information gain and used the 20 top ranked features.
5. Click Start and wait for the cross-validation results. The output result will show the
top 100 features used to build the model (See Table 5 for a list of top 20 features)
and will also show some performance measures. Interestingly, the model has a
better AUC (0.83) than the model that uses all the set of features (0.80).
17 Copyright © Gennotate development team
Table 5: List of top 20 structure features
DI_DUPLEXSTAB_FREEENERGY48
DI_DNABENDSTIFF48
DI_DUPLEXSTAB_DISRUPTENERGY48
DI_DUPLEXSTAB_FREEENERGY49
DI_PINDUCEDDEFORM49
DI_DNABENDSTIFF49
DI_DUPLEXSTAB_FREEENERGY50
DI_DUPLEXSTAB_FREEENERGY47
DI_DNABENDSTIFF47
DI_STACKINGENERGY48
DI_DUPLEXSTAB_DISRUPTENERGY49
DI_DNADENATURE48
DI_DNABENDSTIFF50
DI_ZDNASTABENERGY49
DI_ZDNASTABENERGY50
DI_DUPLEXSTAB_DISRUPTENERGY47
DI_STACKINGENERGY47
DI_DUPLEXSTAB_DISRUPTENERGY50
DI_DNADENATURE50
DI_STACKINGENERGY49
Improved predictions of sigma 70 promoters using Ensemble of classifiers
In this experiment, we build a number of classifiers using RF50 and different choices of the
dinucleotide structure-based features. The base classifiers we be combined together using
a second stage classifier using WEKA Logistic classifier.
1. Run Gennotate
2. Go to Application menu and select model builder application.
3. In the model builder window (WEKA explorer augmented with Gennotate filters
and prediction methods) click open and select the file
/Example/Data/Sigma70.arff.
4. Click classify tab.
5. In classifier panel click choose and browse for weka.classifiers.meta.Stacking. Set
numFolds to 3 and set the metaClassifier to weka.classifiers.functions.Logistic
and input 12 classifiers, each is a FilteredClassifier with RF50 and different
choice of ConversioTable for DDNAFilter.
18 Copyright © Gennotate development team
6. Click Start to run the 10-fold cross-validation experiment. The following figure
shows the result of our experiment.
19 Copyright © Gennotate development team
Case Study 3: Improved prediction of
promoter regions in E.coli using meta-
predictors
An Interesting property of Gennotate is that it allows sharing not only data sets but also the
learned models. Once you have a number of different predictors for the same classification
task. You can: 1) use the Predictor application to apply any of these predictors to some test
data; 2) Rebuild the prediction model using updated/different training data; 3) build a
consensus or hybrid predictor that combine these predictors. The first usage has been
shown earlier. The second usage can be done simply by: loading the new training data;
loading the current model; and performing cross-validation experiment. The results will
show the performance of the new model which also can be saved as a model file for further
use. The third usage is the focus of this case study.
To facilitate the development of a consensus/hybrid predictor that relies on existing
predictors not necessarily developed by the same user, Gennotate provides a meta-
classifier called ModelBased. In the following experiment, we show how to use ModelBased
classifier to build a consensus predictor combining Sigma70HMM and
Sigma70_Stability_RF50 developed earlier.
1. Run Gennotate
2. Go to Application menu and select model builder application.
3. In the model builder window (WEKA explorer augmented with Gennotate filters
and prediction methods) click open and select the file
/Example/Data/Sigma70_tets.arff. Please note that our goal is to combine
existing models. So, there is no need to retrain these models but instead we will
use the test data to evaluate the combination of these predictors.
4. Click classify tab.
5. In classifier panel click choose and browse for weka.classifiers.meta.Vote
20 Copyright © Gennotate development team
6. Input two classifiers each one is using gennotate.classifiers.meta.ModelBased and
set modelFile parameter as showing in the following figure.
7. In the Test options panel, choose use training data. Note that the ModelBased
classifier does not perform any training. It just loads the model and keeps it for
predictions. Hence, what is reported is the performance of applying the models
encapsulated within the ModelBased classifier to what it seems as training data.
The obtained performance of the consensus predictor combining the HMM model and the
RF50 model is almost the same as the performance of HMM (AUC equals 0.81). In practice,
we expect improvements in performance when we combine several (not just two)
predictors.
Please note that we can build a hybrid model using the HMM and RF50 models simply by
following the preceding procedure and replacing the Vote classifier with Stacking classifier.
However, in that case the user might perform cross-validation test and the result should be
handled with caution because the test data has been used to train the meta-predictor in the
Stacking classifier.
21 Copyright © Gennotate development team
Extending Gennotate
Gennotate is extendable, in the sense that anyone can add extra filters or extra
classification methods. To add your own classification methods or filters, please follow the
procedure described in the WEKA documentation on how to write your own classifier and
your own filter. Once you have a jar file including your added components, just add it to our
CLASSPATH when running Gennotate and enjoy your customized version of Gennotate.
References
[1] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining
software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18.
[2] Gan, Y., Guan, J., & Zhou, S. (2012). A comparison study on feature selection of DNA structural properties
for promoter prediction. BMC bioinformatics, 13(1), 4.