Upload
panayiota
View
212
Download
0
Embed Size (px)
Citation preview
Mature DliRNA identification via the use of a Naive Bayes classifier
Katerina Gkirtzou, Panagiotis Tsakalides and Panayiota Poirazi
Abstract-MicroRNAs (miRNAs) are small single strandedRNAs, on average 22nt long, generated from endogenoushairpin-shaped transcripts with post-transcriptional activity.Although many computational methods are currently availablefor identifying miRNA genes in the genomes of various species,very few algorithms can accurately predict the functional partof the miRNA gene, namely the mature miRNA. We introducea computational method that uses a Naive Bayes classifier toidentify mature miRNA candidates based on sequence andsecondary structure information of the miRNA precursor.Specifically, for each mature miRNA, we generate a set ofnegative examples of equal length on the respective precursor(s).The true and negative sets are then used to estimate probabilitydistributions for sequence composition and secondary structureon each position along the RNA. The distance between thesedistributions is estimated using the symmetric Kullback-Leiblermetric. The positions at which the two distributions differsignificantly and consistently over a to-fold cross-validationprocedure are used as features for training the Naive Bayesclassifier. A total of t5 classifiers were trained with true positiveand negative examples from human and mouse. A performanceof76% sensitivity and 65% specificity was achieved using a consensus averaging over a to-fold cross-validation procedure. Ourfindings suggest that position specific sequence and structureinformation combined with a simple Bayes classifier achieve agood performance on the challenging task of mature miRNAidentification.
I. INTRODUCTION
MIcroRNAs (miRNAs) are small, non-coding RNAsthat play an important role in regulating the ex
pression of numerous genes across several species [1]. Asregulatory molecules, they influence the output of manyprotein-coding genes by targeting mRNAs for cleavage ortranslational repression [2].
Although miRNAs are functionally similar to short interfering RNAs (siRNAs), they are unique in terms of theirbiogenesis. Most of the miRNA genes are transcribed byRNA Polymerase II. The primary transcripts of miRNAs(pri-miRNAs) are then processed into hairpin intermediates(precursor miRNAs or pre-miRNAs) by the microprocessor complex (the enzyme Drosha and the binding proteinDGCR8/Pasha). The pre-miRNAs are then exported to the
This work was supported by FORTH-ICS and the EMBO Young Investigator Program
K. Gkirtzou is with the Department of Computer Science, University of Crete and the Institute of Computer Science (ICS), Foundation of Research and Technology, Hellas (FORTH), Heraklion, [email protected]
P. Tsakalides is with the Department of Computer Science, University of Crete and the Institute of Computer Science (lCS), Foundation of Research and Technology, Hellas (FORTH), Heraklion, [email protected]
P. Poirazi is with the Institute of Molecular Biology and Biotechnology(IMBB), Foundation of Research and Technology, Hellas (FORTH), Heraklion, Greece poirazi@imbb. forth. gr
cytoplasm by RanGTP and Exportin-5. In the cytoplasm,the pre-miRNAs are processed by Dicer into short RNAduplexes termed miRNA duplexes. The mature miRNA fromeach miRNA duplex then binds to an Argonaute protein,forming the miRNP complex. The miRNAs base-pair withtheir mRNA targets, leading either to mRNA cleavage, ifthere is sufficient complementarity between miRNA and thetarget mRNA, or to translational repression [3].
Several computational methods have been developed andare currently used in parallel with experimental techniquesin order to facilitate the discovery of new miRNAs. Mostcomputational methods focus on the discovery of either novelmiRNA genes in the genomes of various species or possiblemRNA targets of the known miRNAs. On the contrary, fewattempts have been made to computationally predict thefunctional part of the miRNA precursor, namely the maturemiRNA. A number of studies ([4], [5], [6]) combine miRNAgene prediction with the identification of a possible startposition for the mature. To our knowledge, only one study [7]focuses exclusively on mature miRNA prediction, utilizingthermodynamic and structural information of the precursorRNA.
In this work, we introduce a computational method thatuses a Naive Bayes classifier to identify mature miRNAcandidates based on sequence and secondary structure information of the miRNA precursor.
II. METHOD
A. Naive Bayes Classifier
Naive Bayes is a simple probabilistic classifier whichis based on the application of the Bayesian theorem withstrong (naive) independence assumptions. According to theBayesian classifier, a new sample x described by the featurevector x == (Xl, X2, ... ,xn ) will be assigned to the class thatminimizes the overall risk using the following formula:
e
a(x) == argminaiEA L A(ai IWj )P(Wj Ix)j=I
where:
• WI, ... , We is a finite set of classes,• A == {aI, ... ,ae } is a finite set of actions, where ai
means selecting class Wi,• A(ai IWj) is the loss associated with deciding Wi, when
the true state of nature is W j and• P(Wj Ix) is the posterior probability of Wj being the true
state of nature given x.The posterior probability P(Wj Ix) can be computed by
the Bayes' formula (see section 2.9 of [8]):
where
• P(xlwj) is the state--conditional probability for x conditioned on W j being the true class,
• P(Wj) is the prior probability that nature is in state Wjand
• P(x) == 2::;=1 P(xlWj )P(Wj) is the evidence for x.
The Naive Bayes classifier is based on the simplifyingassumption that the input features among samples of anygiven class are conditionally independent given the class [9].In other words, given the class of a sample, the probabilityof observing the conjuction Xl, X2, ... , Xn is just the productof the probabilities for the individual features of this sample:
n
P(Xl' X2,"" xnlcj) == II P(xiICj)'
In our case, an observation for classification (i.e. a sample)is a mature miRNA candidate and the possible classes aretwo: the Positive class, which contains true mature miRNAs(denoted WI) and the Negative class, which contains falsemature miRNAs (denoted W-l)' Suppose we have a maturemiRNA candidate with features x == (Xl, X2, ... ,xn) andwe want to classify it to the class that minimizes theclassification error. The simplest case is to consider that allerrors have the same cost, so the loss function of interest isthe zero--one loss function and the Bayes Decision Rule isconverted to the following:
Decide WI if P(wllx) > P(w-llx);otherwise decide W-l
(see section 2.3 of [8]). Since P(x) is only a normalizationfactor, it can be omitted in order to minimize calculationtime. Moreover, since there is no information about theprobabilities of each class we can assume that the priorprobability that nature is in state Wj, P(Wj), is 50% forboth positive and negative data. This assumption prevents usfrom favoring a particular class. Under these assumptions,the Bayes Decision Rule is given by the following simplifiedformula:
Decide WI if P(XIWl) > P(XIW-l);othewise decide W-l
In this work, we use the aforementioned formula to buildmultiple classifiers that are trained to discriminate betweenrandomly selected sets of positive and negative miRNAsamples as detailed below.
B. The Negative class
Given that known miRNA precursors do not producemultiple overlapping mature miRNAs from the same armof the foldback precursor [10], we generate a set of negative
~~ 0.25.no~"ffi 0.2co~ 015c .o(J
~ 0.1ro
C30.05
oL..------.......--.2...L--3"'"'--L...----4L-�-----5-.......---6...L-.--7~---8L-1-----'
Feature Values
Fig. 1. Class conditional probabilities of a feature with combined sequenceand structural information for position 0 within mature miRNA candidatesof both the positive (black color) and the negative class (white color). Thex axis shows the possible values of the feature, where: I ---+ A match, 2---+ A mismatch, 3 ---+ C match, 4 ---+ C mismatch, 5 ---+ G match, 6 ---+ Gmismatch, 7 ---+ U match and 8 ---+ U mismatch. The y axis shows the classconditional probability for this feature.
examples in the following way: for each true mature miRNA,we use a same-size sliding window and select all possible"negative" matures which can be created by sliding 1 basepair towards either direction from the mature, excludingany hairpin loops. This procedure results in a very largenegative set, where each true mature has a variable number ofrespective "negatives", depending on the length and numberof precursors. To avoid overfitting the classifiers to thenegative data, we only use a randomly selected subset of10 negative examples for each true mature.
C. Input Features
The miRNA precursors form irregular hairpin structures,containing various mismatches, internal loops and bulges. Inour method, a mature miRNA is represented as a sequence ofpositions, where each position contains sequence information(A, C, U, G) or structural information (match or mismatch),derived from the respective precursor(s). Apart from thefeatures that lie in positions within the mature miRNA, wealso consider features that lie within a flanking region ofvariable size (0, 5, 7, 10 or 12nt) that extends symmetricallyalong both sides of the mature sample. These features arealso positions on the precursor RNA and contain sequence orstructural information just like the features located within themature miRNA. Since certain positions within the flankingregion may be located outside the precursor, we use a special'novalue' flag to indicate the lack of information at thesepositions and do not take them into account when estimatingthe Kullback-Leibler divergence between the two classes(see below).
D. Feature Selection
Classifier's Description Sensitivity Specificity MCC
Combination of 12 Features,Ont flanking region 67.10% 55.10% 0.0850
Combination of 16 Features,5nt flanking region 76.04% 53.34% 0.1074
Combination of 31 Features,7nt flanking region 75.96% 53.20% 0.1071
Combination of 19 Features,10nt flanking region 79.15% 47.01% 0.0960
Combination of 35 Features,12nt flanking region 74.30% 51.33% 0.0945
TABLE I
BAYES CLASSIAER TRAINED WITH FEATURES CONTAINING SEQUENCE
INFORMATION.
III. RESULTS
We evaluated our method using a dataset of experimentallyverified human and mouse miRNAs from miRBase (version10.0, [13], [14], [15]). The human dataset consists of 533precursors and 722 mature miRNAs, while the mouse datasetconsists of 442 precursors and 579 mature miRNAs. Weused 500 of the human and 347 of the mouse precursors-which generate 692 and 440 mature miRNAs, respectivelyto train and validate our classifiers utilizing a leave-la-outcross validation procedure.
For each of the mature miRNAs in the training set, anegative set was generated as described in section II-B. Atotal of 15 classifiers were trained with different featuresets and the classification performance was assessed usingconsensus averaging over aID-fold cross validation. Toensure a realistic estimation of classification accuracy, thevalidation sets consisted of true miRNA precursors, whosemature miRNAs were left out from the training sets inthe cross validation procedure. Classification accuracy wasestimated using a fixed 22nt size sliding window. All possiblematures which could be created by sliding 1 base pair in bothstem arms of the precursor, apart from the hairpin loop(s),were assigned to either class based on the averaged outcomeof the 15 trained classifiers. It is important to note thatclassification accuracy was estimated based on exact matchof the starting position of the predicted compared to thereal mature miRNA. Even Int deviations were consideredas negative examples.
~ 0.2
:0as.Q
£ 0.15a;c:o;;:e;§ 0.1ulI)lI)ast5 0.05
oL..-----.........--2""""---3--.....----4-......--5-'---S'--L--
7L....L...-----a....&....------'
Feature Values
. P(i)DKL(PIIQ) =L P (z)log2 Q(i)
1,
Unfortunately, the KL divergence is not a true metric sinceit is not symmetric. To overcome this problem we usedthe symmetric and nonnegative Kullback-Leibler divergence[12], which is defined as:
12(DKL (PIIQ) + DKL(QIIP))
Fig. 2. Class conditional probabilities of a feature with combined sequenceand structural information for position 3 within mature miRNA candidatesof both the positive (black color) and the negative class (white color). Thex axis shows the possible values of the feature, where: 1 ---+ A match, 2---+ A mismatch, 3 ---+ C match, 4 ---+ C mismatch, 5 ---+ G match, 6 ---+ Gmismatch, 7 ---+ U match and 8 ---+ U mismatch, while the y axis shows theclass conditional probability for this feature.
The aforementioned location-specific information is usedto select a set of features, namely those positions on theprecursor that contain discriminatory information betweentrue matures and negative samples. The discriminatory powerof each position is estimated using the symmetric KullbackLeibler divergence between the distributions of positive andnegative data.
The Kullback-Leibler divergence (K-L divergence) is ameasure of the difference between two probability distributions [11]. For Probability Mass Functions (PMFs) P and Qof a discrete random variable, the K-L divergence of Q fromP is defined as:
and is commonly used in classification problems. Figures1 and 2 show the class conditional probability distributionsfor two features with combined sequence and structureinformation for positions a and 3 within the mature miRNAcandidates of the negative and positive data used to trainour classifiers (see below). These features have the highestKullback-Leibler divergence among all features that liewithin the mature miRNA. As evident from both figures thetwo class distributions are very similar making discriminationa very challenging task.
Tables I, II and III show the top scoring classifiers, basedon Matthews Correlation Coefficient (MCC), for differentinput features. We use three types of classifiers, each utilizinglocation-specific information about the sequence (Table I),the structure (Table II), or both the sequence and structure(Table III) of the training examples. Each table shows thesensitivity, specificity and Matthews Correlation Coefficient(MCC) [16] achieved with different numbers of such featuresand with different sizes of flanking regions around the
TABLE II
BAYES CLASSIFIER TRAINED WITH FEATURES CONTAINING STRUCTURE
INFORMATION.
Classifier's Description Sensitivity Specificity MCC
Combination of 10 Features,Ont flanking region 65.70% 54.30% 0.0730
Combination of 26 Features,5nt flanking region 76.34% 52.64% 0.1056
Combination of 23 Features,7nt flanking region 77.85% 54.29% 0.1186
Combination of 39 Features,1Dnt flanking region 81.01 % 56.63% 0.1373
Combination of 38 Features.12nt flanking region 79.89% 55.51% 0.1300
mature miRNA. The pOSItions along the precursor whichserved as input features were selected based on the K-Ldivergence metric and they were located either within themature miRNA, or inside a flanking region around it.
We found that as the size of the flanking region increased,the sensitivity of the classifiers tended to improve, while thespecificity remained relatively unaffected, independently ofthe type of features used. This improvement seemed to reacha maximum for a flanking region of about 10nt. For classifiers with flanking regions of 12nt utilizing either sequence orstructure information (Tables I and II respectively), the extrafeatures did not further improve the accuracy, suggesting thatthey probably add more noise than useful information.
TABLE III
BAYES CLASSIFIER TRAINED WITH FEATURES CONTAINING BOTH
SEQUENCE AND STRUCTURE INFORMATION.
Classifier's Description Sensitivity Specificity MCC
Combination of 20 Features,Ont flanking region 68.50% 62.50% 0.1250
Combination of 29 Features,5nt flanking region 71.32% 65.34% 0.1394
Combination of 36 Features,7nt flanking region 74.26% 66.46% 0.1562
Combination of 42 Features,10nt flanking region 76.50% 65.61% 0.1606
Combination of 39 Features,12nt flanking region 77.81% 64.14% 0.1590
Moreover, the classifiers utilizing features with combinedinformation for both sequence and structure achieved anoverall better performance -in terms of improved specificityand MCC- than the ones using sequence or structure infor-
mation alone. Note that a high specificity score is particularlyimportant in this task, since the number of negative examplesis much larger than the number of positive ones. Finally, allclassifiers achieved a much higher sensitivity than specificityscore, most likely because of the very high similarity betweennegative and positive examples as well as the requirementfor exact start position match between true and predictedmiRNAs.
IV. CONCLUSIONS
In this work, we presented a computational approachthat identifies mature miRNAs based on the secondarystructure and sequence characteristics of the precursor. Weused experimentally verified miRNAs to train and evaluatethe performance of a Naive Bayes classifier in terms ofsensitivity and specificity.
Unlike the method presented here, most of the computational tools that can be used to predict the functional part ofthe miRNA precursor estimate their performance accuracyin terms of true positive rate alone (sensitivity), ignoring thefalse positive rate ([4], [6], [7]). It is a matter of semantics aswell as a great challenge to define a true negative examplewhen it comes to mature miRNAs. However, a major issuein such a classification task is not only to maximize theidentification of true positives but also to minimize the falsepositive rate. Our method tries to combine both of thesecriteria. Since minimizing the false positive rate is verydifficult, we plan to develop and incorporate a number offiltering criteria that will help eliminate some of the falsepositive examples. Moreover, we plan to use our methodto predict the mature miRNAs on six novel miRNA geneswhich were recently identified by Hidden Markov Modelsin our lab and were experimentally shown to produce shortRNAs [17]. This combined computational and experimentalapproach will result in a more realistic evaluation of ourmethod's performance accuracy.
In conclusion, our findings suggest that position specificsequence and structure information combined with a simpleBayes classifier achieve a good performance on the challenging task of mature miRNA identification.
REFERENCES
fI] L. He and G. 1. Hannon, "MicroRNAs: small RNAs with a big rolein gene regulation," Nature Genetics, vol. 5, pp. 522-532, 2004.
[2] D. P.Barte1, "MicroRNAs: Genomics, Biogenesis, Mechanism, andFunction," Cell, vol. 116, pp. 281-297, 2004.
[3] X. Liu, K. Fortin, and Z. Moure1atos, "MicroRNAs: Biogenesis andMolecular Functions," Brain Pathology, vol. 18, no. 1, pp. 113-121,2008.
[4] J.-W. Nam, K.-R. Shin, 1. Han, Y. Lee, V. N. Kim, and B.-T. Zhang,"Human microRNA prediction through a probabilistic co-learningmodel of sequence and structure," Nucleic Acids Research, vol. 33,no. 11, pp. 3570--3581, 2005.
[5] M. Yousef, M. Nebozhyn, H. Shatkay, S. Kanterakis, L. C. Showe, andM. K. Showe, "Combining multi-species genomic data for microRNAidentification using a Naive Bayes classifier," Bioinformatics, vol. 22,no. 11, pp. 1325-1334, 2006.
[6] Y. Sheng, P. G. Engstrom, and B. Lenhard, "Mammalian MicroRNAPrediction through a Support Vector Machine Model of Sequence andStructure," PloS ONE, vol. 2, no. 9, 2007.
[7] M. Tao, "Thermodynamic and structural consensus principle predictsmature miRNA location and structure, categorizes conservedinterspecies miRNA subgroups and hints new possible mechanismsof miRNA maturization," Control and Dynamical Systems, CaliforniaInstitute of Technology, Tech. Rep., 2007. [Online]. Available:http://www.citebase.org/abstract?id=oai:arXiv.org:0710.4181
[8] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2ndEdition). Wiley-Interscience, November 2000.
[9] T. M. Mitchell, Machine Learning. McGraw-Hill International, 1997.[10] V. Ambros, B. Bartel, D. P. Bartel, C. B. Burge, 1. C.Carrington,
X. Chen. G. Dreyfuss. S. R. Eddy, S. Griffiths-Jones. M. MarshalLM. Matzke, G. Ruvkun, and T. Tuschl, "RNA: A uniform system formicroRNA annotation," RNA, vol. 9, pp. 277-279, 2003.
[11] S. Kullback and R. A. Leibler, "On Information and Sufficiency," TheAnnals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951.
[12] H. Jeffreys, "An Invariant Form for the Prior Probability in EstimationProblems," Proceedings of the Royal Society of London. Series A,Mathematical and Physical Sciences, vol. 186, no. 1007, pp. 453461, 1946.
[13] G. S. Jones, "The microRNA Registry," Nucleic Acids Res, vol. 32,no. Database issue, pp. DI09-Dlll, Jan 2004.
[14] S. Griffiths-Jones. R. 1. Grocock. S. van Dongen, A. Bateman.and A. J. Enright, "miRBase: microRNA sequences, targetsand gene nomenclature," Nucleic Acids Res, vol. 34,no. Database issue, January 2006. [Online]. Available:http://view.ncbi.nlm.nih.gov/pubmed/16381832
[15] S. Griffiths-Jones. H. K. K. Saini. S. v. V. Dongen, andA. J. 1. Enright, "miRBase: tools for microRNA genomics,"Nucleic Acids Res, November 2007. [Online]. Available:http://dx.doi.org/l0.l093/nar/gkm952
[16] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen,"Assessing the accuracy of prediction algorithms for classification: anoverview," Bioinformatics, vol. 18, no. 5, pp. 412-424, 2000.
[17] A. Oulas, A. Boutla, K. Gkirtzou, M. Reczko, K. Kalantidis, andP. Poirazi, "Prediction of novel microRNA genes for cancer associated genomic regions a combined computational and experimentalapproach," in preperation.