32
Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Embed Size (px)

Citation preview

Page 1: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Improved and Promising Identification

of Human MicroRNAs by Incorporating

a High-Quality Negative Set

Page 2: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Arrangement of the Report

1 Introduction

2 Methods

3 Results and Discussion

4 Conclusion

1

Page 3: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Introduction

Page 4: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Introduction

Brief Introduction to microRNA

MicroRNA (miRNA) is a class of single strand, non-coding endogenous RNAs, with ~22 nucleotides (nt) in sequence length. miRNAs play key roles in regulating biological processes, including affecting stability and translation of mRNAs and negatively regulating gene expressionin post-transcriptional processes.

3

Page 5: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Introduction

How did the microRNA formed

1) miRNA genes are first transcribed by RNA polymerase II, resulting in the primarytranscripts, which are usually termed as pri-miRNAs.

2) The pri-miRNAs are processed by the enzyme Drosha into miRNA precursors (pre-miRNAs) with a distinctive hairpin structure.

3) The pre-miRNAs are exported into the cytoplasm by Exportin-5 and cleaved by the enzymeDicer to yield miRNA:miRNA* duplexes. One strand of the duplex, denoted with *, is normally degraded. 4

Page 6: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Introduction

Traditional method to identity the miRNA

1) Prefer computational method rather than experimental methods for the time and money reasons.

2) Mainly discriminate the real pre-miRNA from the pseudo ones. 3) Triplet-SVM proposed by Xue et al. is a popular tool which employs a

support vector machine (SVM) classifier to train 32 triplet sequence-structure features in human sequences, and successfully identified human pre-miRNAs with about 90 percent accuracy on both human data and data from other species.

4) Currently, the widely used classification algorithms include SVM,hidden Markov model (HMM) , random forest (RF) , linear genetic programming (LGP) , and naıve Bayes.

5

Page 7: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Introduction

Why we want to improve the quality of negative set ?

1) It is acknowledged that, when negative samples are sufficiently similar to the positive samples, the negative samples are considered to be of high-quality or representativeness.

2) The negative samples were usually collected by a parameter filtering method, which selects those sequences that share the widely accepted characteristics of real pre-miRNAs.

6

Page 8: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Introduction

The parameter filter method versus proposed new technique 1) Pre-defined parameter types

might not be available because more types of characteristics are being discovered as the number of known miRNAs continues to grow.

2) Confining the values to within a certain scope and reducing the specificity of the collected negative samples.

3) Pre-defined parameter assumptions likely miss other information about real pre-miRNAs.

1) It largely reduces dependence on filtering parameters

2) It has high adaptability and can be adjusted as new miRNAs are discovered, which guarantees that the collected pseudo pre-miRNAs are sufficiently similar toreal pre-miRNAs.

7

Page 9: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Methods

Page 10: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Methods

1 Proposed feature set

3

2

4

5 6

Classifier selection and optimization

Data Sets

Negative Sample Selection

A miRNA Mining Tool—mirnaDetect

Measurement

9

Page 11: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Feature set

The 98-feature set we constructed

1) Primary sequence based features. For a given RNA sequence S, triple-nucleotide frequencies %XYZ are computed, where XYZ represents the contiguous three nucleotides in S, and X,Y and Z belongs to the set {A,U,C,G}. 4*4*4=64

2) Secondary structure based features. The minimum freeenergy (MFE), and there are also significant differences in the base-pair content of a secondary structure between real and pseudo pre-miRNAs. 2

3) Sequence-structure based features. features containing both local sequence and structure information were also considered. We used the 32 sequence-structure-based features. 2*2*2*4=32

10

Page 12: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Classifier selection and optimization

1) They used the SVM algorithm as the classification algorithm in the present research.

2) The kernel function is the radial basis function (RBF).

3) Conducted a grid search for LibSVM based on our training set, and obtained the optimal parameters.

11

Page 13: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Negative sample selection

How to select the negative samples?

Step 1. Search for homologous sub-sequences of the CDSs(coding region sequences) from known mature miRNAs using BLAST with its default setting. Collect the homologous sub-sequences into a “homology set” called S-homology.Step 2. Flank all the elements in S-homology by 100 nt upstream and downstream, and compute the secondary structures of the corresponding flanked elements with RNAfold. The extracted subsequences were collected into a “pre-miRNA-like candidate” set called S-candidate.Step 3. Use a filtering method to filter S-candidate. In the first level, two widely accepted criteria of real pre-miRNAs (MFEI > 0.8, and 0.7 > GC%> 0.3) are used to filter.

12

Page 14: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Negative sample selection

How to select the negative samples?

Step 4. The S-candidate elements that remain after the Step 3 filtering are fed into a LibSVM classifier model called . The elements predicted as “positive” (I think it should delete the true positive ones in them)were considered as potential real pre-miRNAs, and collectedin a new set called .Use to replace the negative training set of the prediction model and rebuild the original prediction model to generate a new prediction model Then, repeat Step1 to Step 4 several times until i > 10. In this way, we obtain a set of negative sets and a set of prediction models.={} P = {}

13

Page 15: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Negative sample selection How to select the negative samples?

After eliminating sequences that shared sequence identity > 50% with real pre-miRNAsin the CDSs from the selected negative set ,14,661 hairpin-like sequences (negative samples) remained in the final negative set.

14

Page 16: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Data set and Measurement

Data set

Positive They removed redundant sequences from the positive set, leaving 16,520 non-redundant premiRNAs, including 1,496 human and 13,588 non-human pre-miRNAs in the final positive set.Negative A total of 14,661 pseudo pre-miRNAs including 1446 pseudo human pre-miRNAs.Training Set The training set consists of 1,155 real and 1,155 pseudo human pre-miRNAs.Test Set There are several kinds of test set we will introduce in next pages.

15

Page 17: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Data set and Measurement

Measurement

Sensitivity(same as Recall), Specificity, Geometric mean and Accuracy.

16

Page 18: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Results/Discussion

Page 19: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Importance of Negative Samples

These graph reveals the importance of negative samples for machine learning algorithms, and indicate that the higherthe similarity between the positive and negative training sets, the higher will be the performance of the classifier.

Importance of the Representative Negative Samples

18

Page 20: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Importance of Negative Samples

Representativeness of Our Negative Set

Modeling the Triplet-SVM-classifier with new negative set

Modeling the new miRNApre classifier with Xue’s negative set.

19

Page 21: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Importance of Negative Samples

Representativeness of Our Negative Set

They also remodeled state-of-the-art classifier (Mirident) using our negative set, and generates a new training model.The performance in the virus set EBV, HCMV, MGHV68 and KSHV is not good. It is expected that as more viral pre-miRNAs are discovered, the new model will perform betterthan the original one.

20

Page 22: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Performance of MiRNApre

Performance of the LibSVM Classifier

All four classifiers based on the proposed feature set were evaluated with a 10-fold cross validation on our training set.This experimental result also confirmed the high efficiency of the SVM algorithm.

21

Page 23: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Performance of MiRNApre

Performance of LibSVM on the Proposed Feature Set

Ten-fold cross validation was used to evaluate the performance of LibSVM based on the same training set but different feature sets. We can implies that the features in C(%XYZ) may contain important attributes for the identification of human pre-miRNAs.We can prove that in the next page.2 structure-based features

32 sequence-structure-based features

64 primary sequence-based features 22

Page 24: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Performance of MiRNApre

Performance of LibSVM on the Proposed Feature Set

We also investigate the importance of each of the features in the proposed combined feature set, which we expect will help researchers select the features that are “important” in their specific situations. The top 10 “important” features are listed in Table and it indicates their major influence on the identification of real/pseudo human pre-miRNAs.

primary-sequence related features 9

structure-based features 1

sequence-structure-based features 0 23

Page 25: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Performance of MiRNApre

Analyzing the Performance of miRNApre

We used a 10-fold cross validation test to evaluatethe performance of miRNApre on the training set whichcontains 1,155 real pre-miRNAs and 1,155 pseudo pre-miRNAs.In real applications, the high SP is meaningful because there are far more pseudo pre-miRNAs than real pre-miRNAs in genome data.

real pre-miRNAs pseudo pre-miRNAs SP SE Acc Gm

1155 1155 98.2% 97.9% 98.1% 98%

24

Page 26: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Performance of MiRNApre

Analyzing the Performance of miRNApre

The test set contains 69 newly found human pre-miRNAs.

The miRNApre performed better, even in Virus it is also comparable. 25

Page 27: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Performance of MiRNApre

Performance of mirnaDetect

The method should generate asfew false positives as possible to save time and money doing experiment on them. From this viewpoint, MIReNA and CSHMM generated 10,626, and 18,258 premiRNA candidates, respectively, while mirnaDetectfound only 2,645 candidates.

Methods SE SP

mirnaDetect proper high

26

Page 28: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Conclusions

Page 29: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Conclusions

1) In this study, we explored the importance of representative negative samples for machine learning based methods for pre-miRNA identification. We found that existing negative sets suffer from low quality, and based on them it is difficult to generate an effective and promising prediction model.

2) To improve the quality of negative samples, we proposed a multi-level negative sample selection method and successfully constructed a high-quality negative set.

3) The high accuracy of our miRNApre method on different data sets suggests that our method is a promising tool for miRNA identification.

28

Page 30: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Leyi Weireceived the BSc degree in computing

mathematics and the MSc degree in computer

science from Xiamen University, China

Minghong Liao

Yue Gao

About the Authors

received the MSc and PhD degrees in computer

science and engineering from Harbin

Institute of Technology, China, in 1988 and 1993

received the BS degree from the Harbin

Institute of Technology, China, in 2005, and the

ME and PhD degrees from Tsinghua University,

Beijing, China

Page 31: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Any Questions?We can discuss!

Q&A

Page 32: Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set

Thanks!

魏琪康