LncFinder: an integrated platform for long non-coding RNA

LncFinder: an integrated platform for long non-coding

RNA identification utilizing sequence intrinsic

composition, structural information and

physicochemical propertySiyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du,Cankun Wang and Ying LiCorresponding author: Ying Li, College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering ofMinistry of Education, Jilin University, Changchun 130012, China. Tel.: 86-13504319660; E-mail: [email protected]

Abstract

Discovering new long non-coding RNAs (lncRNAs) has been a fundamental step in lncRNA-related research. Nowadays,many machine learning-based tools have been developed for lncRNA identification. However, many methods predictlncRNAs using sequence-derived features alone, which tend to display unstable performances on different species.Moreover, the majority of tools cannot be re-trained or tailored by users and neither can the features be customized or inte-grated to meet researchers’ requirements. In this study, features extracted from sequence-intrinsic composition, secondarystructure and physicochemical property are comprehensively reviewed and evaluated. An integrated platform namedLncFinder is also developed to enhance the performance and promote the research of lncRNA identification. LncFinderincludes a novel lncRNA predictor using the heterologous features we designed. Experimental results show that our methodoutperforms several state-of-the-art tools on multiple species with more robust and satisfactory results. Researchers canadditionally employ LncFinder to extract various classic features, build classifier with numerous machine learning algo-rithms and evaluate classifier performance effectively and efficiently. LncFinder can reveal the properties of lncRNA andmRNA from various perspectives and further inspire lncRNA–protein interaction prediction and lncRNA evolution analysis.It is anticipated that LncFinder can significantly facilitate lncRNA-related research, especially for the poorly explored spe-cies. LncFinder is released as R package (https://CRAN.R-project.org/package¼LncFinder). A web server (http://bmbl.sdstate.edu/lncfinder/) is also developed to maximize its availability.

Siyu Han is a graduate student at the College of Computer Science and Technology, Jilin University, Changchun, China. His research interests includecomputational biology and machine learning methods.Yanchun Liang is a professor at the College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineering ofMinistry of Education, Jilin University, Changchun, China. He is also a professor at Zhuhai Laboratory of Key Laboratory of Symbol Computation andKnowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China. His research interests include computational intelli-gence, machine learning methods, text mining, and bioinformatics.Qin Ma is the director of the Bioinformatics and Mathematical Biosciences Lab and an assistant professor in the Department of Agronomy, Horticultureand Plant Science at South Dakota State University. He is also an adjunct faculty member in the Department of Mathematics and Statistics at SouthDakota State University and BioSNTR, SD, USA.Yangyi Xu is a graduate student at the College of Computer Science and Technology, Jilin University, Changchun, China.Yu Zhang is an associate professor at the College of Computer Science and Technology, Jilin University, Changchun, China.Wei Du is an associate professor at the College of Computer Science and Technology, Jilin University, Changchun, China.Cankun Wang is a graduate student in the Department of Mathematics and Statistics, South Dakota State University, SD, USA.Ying Li is an associate professor at the College of Computer Science and Technology, Key Laboratory of Symbol Computation and Knowledge Engineeringof Ministry of Education, Jilin University, Changchun, China. Her research topics include machine learning, bioinformatics and computational biology.Submitted: 19 May 2018; Received (in revised form): 20 June 2018

VC The Author(s) 2018. Published by Oxford University Press.This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited.For commercial re-use, please contact [email protected]

1

Briefings in Bioinformatics, 2018, 1–19

doi: 10.1093/bib/bby065Review article

Downloaded from https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bby065/5062950by gueston 06 August 2018

https://CRAN.R-project.org/package=LncFinder


http://bmbl.sdstate.edu/lncfinder/


https://academic.oup.com/

Key words: sequence intrinsic composition; multi-scale secondary structure; EIIP physicochemical property;machine learning; predictive modeling

Introduction

Long non-coding RNAs (lncRNAs), one kind of transcripts thatare longer than 200 nucleotides and unable to encode proteinsin the intracellular space, have been at the forefront in recentyears [1–3]. Several studies indicate that more than 80% of thehuman genome has biochemical functions, whereas only lessthan 2% of the genome can be translated into proteins [4, 5].Furthermore, up to 70% of the non-coding sequences are tran-scribed into lncRNAs [6]. All these figures suggest that lncRNAsembrace lots of valuable information awaiting our exploration.Only a small fraction of lncRNAs have been studied, but scien-tists have discovered a wide range of biological processes thatlncRNAs involved, such as epigenetic regulation, metabolicprocesses, chromosome dynamics and cell differentiation [7–12]. Lots of evidence have indicated that lncRNAs are highlyrelevant to various complex human diseases [13] such as lungcancer [14] Alzheimer diseases [15] and cardiovascular diseases[16]. Database LncRNADisease [17] and Lnc2Cancer [18] havecollected thousands of experimentally verified relations be-tween lncRNAs and diseases, which have also confirmed the in-timate connections between lncRNAs and diseases.

Nowadays, next-generation sequencing technologies havefurnished us thousands and thousands of unclassified tran-scripts, demanding prompt studies. From identification to an-notation, many platforms and databases have been developedto facilitate research on lncRNAs [19, 20]. LncRNA identificationis the fundamental step of lncRNA research. Many methods andtools have been developed using machine learning techniques.CPC (Coding Potential Calculator) [21] aligns sequences againstreference protein database, which is highly representative ofthe alignment-based tools. As an extremely powerful tool forcoding potential assessment, CPC is not tailored for lncRNAidentification, but it can predict lncRNAs according to the openreading frame (ORF) information and alignment results ofBLASTX [22]: transcripts with great coding potential tend to pos-sess HITs and ORFs with relatively high quality and thus beingclassified as protein-coding RNAs. Nonetheless, several majorlimitations can hardly be avoided because of CPC’s greatreliance upon reference protein databases. First, lncRNAs areless conserved than mRNAs. A high proportion of lncRNAs dis-play many features similar to protein-coding sequences [23],which may mislead CPC and make it incorrectly categorizelncRNAs as mRNAs. Second, CPC requires a high-quality and ra-ther comprehensive database, but many species have only in-sufficient annotations. Moreover, CPC relies heavily on theoutputs of BLASTX, but multiple-sequence alignment tools can-not guarantee optimal alignments [24]. Finally, the extremelytime-consuming process of alignment makes the use of CPC onmassive-scale data difficult. Apart from BLASTX, other types ofalignment are also employed to identify lncRNAs. PhyloCSF [25]is based on multiple alignments and phylogenetic model com-parison; COME [26] uses features from BLASTX and phastConsscore [27]; lncRNA-ID [28] and lncRScan-SVM [29] employprofile-hidden Markov model-based alignment [30] andPhastCons score, respectively.

Owing to the restrictions of CPC, several alignment-freeapproaches are developed afterward. CPAT (Coding PotentialAssessment Tool) [24], CNCI (Coding-Non-Coding Index) [31]and PLEK (predictor of long noncoding RNAs and messenger

RNAs based on an improved k-mer scheme) [32] are the typicalexamples of this category. The main advantage of alignment-free tools is high efficacy without loss of accuracy. CPAT calcu-lates Fickett TESTCODE score [33, 34] and hexamer score on ORFregion to measure the differences of nucleotide position andcodon usage between non-coding transcripts and protein-coding transcripts. The features of CNCI are based on adjoiningnucleotide triplets (ANTs) matrix and unequal distribution ofcodons (codon bias). PLEK employs improved k-mer scheme toclassify the sequences. COME selects Infernal result [48], expres-sion data [49, 50] and histone modification [4] as features. Toovercome the drawbacks of CPC, a new lncRNA identificationtool named CPC2 [35] was developed recently. Unlike CPC, CPC2is an alignment-free tool and is based on only sequence-intrin-sic features. Compared with CPC, CPC2 displays substantial im-provement in accuracy and efficiency. In addition to severalclassic features such as ORF information and Fickett TESTCODEscore, CPC2 also utilizes isoelectric point (pI) [36, 37] to calculatecoding potential and thus predicts lncRNAs. Alignment-freelncRNA identification tools also include DeepLNC [38], lncScore[39] and FEElnc [40]. All these popular lncRNA identificationmethods are summarized in Table 1. From Table 1, it can beobserved that features of many methods are based upon adjoin-ing nucleotide frequencies, directly or indirectly. The essence ofthese kinds of features is to evaluate the differences in intrinsiccomposition between lncRNAs and mRNAs. However, the prob-lem is the sequence compositions varies from species to species,and thus these methods provide very unstable performances ondifferent species [41]. One possible way to cushion this negativeeffect is re-training the machine learning model for differentspecies, although only CPAT and PLEK can be re-trained byusers. However, the limitation of this remedy includes insuffi-cient sequences of many species that makes it impossible to tai-lor the model specifically for every species. Thus, the pre-builtmodel should be well qualified for various species.

Different tools also select different machine learning algo-rithms to construct a classifier: CPC, CNCI, PLEK, lncRScan-SVMand CPC2 use support vector machine (SVM); CPAT andlncScore employ logistic regression, whereas lncRNA-ID, COMEand FEElnc are based on random forest or balanced random for-est [46, 47]. It is also worth mentioning that DeepLNC is con-structed using deep learning algorithm, deep neural network(DNN).

All these approaches can conduct lncRNA identification, butit is difficult for users to customize the tools for non-model or-ganism transcriptomes or analyze sequences with specific fea-tures. In addition, different tools employ different machinelearning algorithms. But to what extent will different machinelearning algorithms alter the classifiers’ performances? In thisstudy, we establish an integrated lncRNA identification packageLncFinder, which could help users extract features from differ-ent feature categories, construct classifiers with various ma-chine learning algorithms and evaluate the performances ofdifferent feature combinations or machine learning algorithms.Besides, methods based on k-mer frequencies often have a largenumber of features and demonstrate unstable results on differ-ent species. LncFinder includes two schemes to refine the fea-tures and enhance the stability. Experimental results show that

2 | Han et al.


Deleted Text: .

Deleted Text: ,

Deleted Text: &hx2019;s

Deleted Text: ,

Deleted Text: the

Deleted Text: ,

Deleted Text: reliant

Deleted Text: incorrectly

Deleted Text: only

Deleted Text: the

Deleted Text: be hardly applied to

Deleted Text: profile

Deleted Text: Due

Deleted Text: is

Deleted Text: only

Deleted Text: sequence

Deleted Text: a

Deleted Text: predicting

Deleted Text: ,

Deleted Text: this

Deleted Text: which makes

Deleted Text: present

Deleted Text: is

Deleted Text: ,

Deleted Text: utilize

Deleted Text: ,

Deleted Text: 42

Deleted Text: Deep

Deleted Text: Neural

Deleted Text: Network

Tab

le1.

Ove

rvie

wo

fm

ach

ine

lear

nin

g-ba

sed

lncR

NA

iden

tifi

cati

on

too

ls

Met

ho

ds

CPC

[21]

CPA

Ta

[24]

CN

CI[

31]

PLEK

[32]

LncR

NA

-ID

[28]

lncR

Scan

-SV

M[2

9]D

eep

LNC

[38]

CO

ME

[26]

CPC

2[3

5]

Yea

r20

0720

1320

1320

1420

1520

1520

1620

1620

17C

ateg

ory

Pro

tein

-co

din

gp

ote

nti

alca

lcu

lato

r

Pro

tein

-co

din

gp

ote

nti

alca

lcu

lato

r

lncR

NA

pre

dic

tor

lncR

NA

pre

dic

tor

lncR

NA

pre

dic

-ti

on

met

ho

dln

cRN

Ap

red

icto

rln

cRN

Ap

red

icto

rPr

ote

in-c

od

ing

po

ten

tial

calc

ula

tor

Pro

tein

-co

din

gp

ote

nti

alca

lcu

lato

rIn

pu

tFo

rmat

FAST

AFA

STA

,BED

FAST

A,G

TF

FAST

AG

TF

FAST

AG

TF

FAST

A,G

TF

Spec

ies

Mu

lti-

spec

ies

Hu

man

,mo

use

,V

erte

brat

e,p

lan

tV

erte

brat

e,p

lan

tH

um

an,m

ou

seh

um

anh

um

an,m

ou

se,

Mu

lti-

spec

ies

fly,

zebr

afish

fly,

wo

rm,p

lan

tR

equ

irem

ents

Lin

ux,

BLA

ST,

Lin

ux,

Pyth

on

2.7,

Lin

ux,

Pyth

on

2.7

Lin

ux,

Pyth

on

2.7

Lin

ux,

Pyth

on

2.7,

Lin

ux,

RLi

nu

x,Py

tho

n,

Pro

tein

dat

abas

eR

Bio

pyt

ho

nB

iop

yth

on

Mo

del

SVM

Logi

stic

regr

essi

on

SVM

SVM

Bal

ance

dra

nd

om

fore

stSV

Md

eep

neu

ral

net

wo

rkba

lan

ced

ran

do

mfo

rest

sup

po

rtve

cto

rm

ach

ine

Feat

ure

sO

RF

info

rmat

ion

,B

LAST

X[2

2]O

RF

len

gth

,tra

n-

scri

pt

len

gth

,Fi

cket

tT

EST

CO

DE

sco

re[3

3,34

],H

exam

ersc

ore

AN

Tin

form

a-ti

on

,co

do

nbi

as

Imp

rove

dk-

mer

freq

uen

cies

OR

Fle

ngt

han

dco

vera

ge,r

ibo

-so

me

inte

r-ac

tio

n[4

2–44

],p

rofi

leH

MM

-ba

sed

alig

n-

men

t[3

0]

Co

un

to

fst

op

cod

on

,exo

nin

form

atio

n,

txC

dsP

red

ict

sco

re[4

5],

Phas

tCo

ns

sco

re[2

7]

k-m

erfr

equ

enci

esG

Cco

nte

nt,

BLA

STX

[22]

,Ph

astC

on

ssc

ore

[ 27]

,rib

o-

som

ep

rofi

lin

g[3

],IN

FER

NA

Lre

sult

[48]

,ex-

pre

ssio

nd

ata

[49,

50],

his

-to

ne

mo

difi

ca-

tio

n[4

]

OR

Fin

form

atio

n,

Fick

ett

TES

TC

OD

Esc

ore

[33,

34],

iso

elec

tric

po

int

[36,

37]

Re-

trai

nin

g�

�Pu

bMed

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/17

6316

15

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/23

3357

81

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/23

8924

01

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/25

2390

89

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/26

3159

01

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/26

4373

38

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/27

6087

26

htt

ps:

//w

ww

.n

cbi.n

lm.n

ih.

gov/

pu

bmed

/28

5210

17So

ftw

are

htt

p:/

/cp

c.cb

i.p

ku.e

du

.cn

/d

ow

nlo

ad

htt

ps:

//so

urc

efo

rge.

net

/pro

jec

ts/r

na-

cpat

htt

ps:

//gi

thu

b.co

m/w

ww

-bio

info

-org

/CN

CI

htt

ps:

//so

urc

efo

rge.

net

/pro

jec

ts/p

lek

htt

ps:

//so

urc

efo

rge.

net

/pro

jec

ts/

lncr

scan

svm

htt

ps:

//bi

ose

rver

.ii

ita.

ac.in

/d

eep

lnc

htt

ps:

//gi

thu

b.co

m/l

ula

b/C

OM

E

htt

p:/

/cp

c2.c

bi.

pku

.ed

u.c

n/

do

wn

load

.ph

p

Web

Serv

erh

ttp

://c

pc.

cbi.

pku

.ed

u.c

nh

ttp

://l

ilab

.re

sear

ch.b

cm.

edu

/cp

at

htt

p:/

/cp

c2.c

bi.

pku

.ed

u.c

n

Abr

ief

sum

mar

yo

fse

vera

lln

cRN

Aid

enti

fica

tio

nto

ols

.Am

on

gth

em

eth

od

ssu

mm

ariz

edab

ove

,th

em

ajo

rity

of

too

lsid

enti

fyln

cRN

As

wit

hse

qu

ence

-der

ived

feat

ure

sal

on

e,an

do

nly

CPA

Tan

dPL

EKca

nbe

re-t

rain

edby

use

rs.

Dee

pLN

Co

nly

pro

vid

esw

ebse

rver

,wh

ile

CN

CI,

PLEK

,ln

cRSc

an-S

VM

and

CO

ME

can

bed

ow

nlo

aded

for

loca

luse

.CPC

,CPA

Tan

dC

PC2

are

rele

ased

asst

and

-alo

ne

app

lica

tio

nas

wel

las

web

serv

er.A

llth

est

and

-alo

ne

too

lsli

sted

in

the

tabl

ere

qu

ire

Lin

ux

op

erat

ing

syst

em.

aC

PAT

has

up

dat

edit

sfe

atu

res

inth

ela

test

vers

ion

.Ori

gin

alfe

atu

re”

cove

rage

of

OR

F”h

asbe

enre

pla

ced

wit

hfe

atu

re‘le

ngt

ho

fth

etr

ansc

rip

t’.F

eatu

res

‘Fic

kett

TES

TC

OD

Esc

ore

’an

d‘h

exam

ersc

ore

’are

calc

ula

ted

on

OR

Fre

gio

n.

LncFinder | 3


https://www.ncbi.nlm.nih.gov/pubmed/17631615
































http://cpc.cbi.pku.edu.cn/download



https://sourceforge.net/projects/rna-cpat



https://github.com/www-bioinfo-org/CNCI



https://sourceforge.net/projects/plek



https://sourceforge.net/projects/lncrscansvm




https://bioserver.iiita.ac.in/deeplnc



https://github.com/lulab/COME



http://cpc2.cbi.pku.edu.cn/download.php



http://cpc.cbi.pku.edu.cn

http://cpc.cbi.pku.edu.cn

http://lilab.research.bcm.edu/cpat



http://cpc2.cbi.pku.edu.cn

http://cpc2.cbi.pku.edu.cn

the schemes we proposed achieve satisfactory accuracy andstability. Apart from sequence composition features used byexisting tools, we are also wondering whether there are anyother kinds of critical information embodied in a sequence thatcan be explored to predict lncRNA. Thus, features from second-ary structure and physicochemical property are explored andintroduced to conduct lncRNA identification. In light of thecomprehensive feature exploration and selection, a novellncRNA identification method is also developed. Benchmarkedagainst several state-of-the-art tools, our method presents themost satisfactory overall result on multiple species.

Hexamer score [24], as the most discriminating feature ofCPAT, is an ingenious method to streamline the features of thek-mer scheme. In this article, we defined two measurementsEuclidean-distance and Logarithm-distance to capture the dif-ferences of sequence-intrinsic composition between lncRNAand protein-coding RNA. Having been evaluated on the humandata set, the Euclidean-distance scheme achieved an accuracy of0.8484 and Logarithm-distance achieved 0.8521, while the hex-amer score of CPAT obtained an accuracy of 0.6416. We nextinvestigated other kinds of features from the perspective of sec-ondary structure and physicochemical property. In the second-ary structure-derived feature group, multi-scale structuralinformation was introduced. Furthermore, electron–ion inter-action pseudo-potential (EIIP) [51] was employed to calculatephysicochemical features. These two feature groups were eval-uated comprehensively using recursive feature elimination (RFE)and the feature selection algorithm we designed. Tested on thesame data set as sequence-derived features, feature combina-tions from multi-scale secondary structure category obtained anaccuracy of 0.8525, while EIIP-based physicochemical featuresobtained an accuracy of 0.8853. Together with sequence-intrin-sic composition features, three feature groups were incorporatedinto five different machine learning classifiers constructed withlogistic regression, SVM, random forest, extreme learning ma-chine (ELM) and deep learning algorithms to assess to what de-gree can different machine learning algorithms affect theperformance of these features. The novel lncRNA identificationmethod we proposed was developed using the optimal featurecombination and machine learning algorithm. We finally com-pared this new method with several widely used tools on thedata sets of human (Homo sapiens), mouse (Mus musculus), wheat(Triticum aestivum), zebrafish (Danio rerio), and chicken (Gallus gal-lus). On the one hand, we expect our method can show improve-ments in accuracy and efficiency, and on the other hand, weintend to evaluate each tool’s stability on different species.

LncFinder is an integrated package that can be used to pre-dict lncRNA and analyze the properties of lncRNA. LncFinderaims to offer new perspectives to capture the differences be-tween lncRNAs and mRNA. Compared with existing lncRNAidentification tools, LncFinder has the following merits:

i. LncFinder includes an innovative algorithm predictinglncRNAs using heterologous features from three differentcategories: intrinsic composition of sequence, multi-scalestructural information and physicochemical property basedon EIIP and fast Fourier transform (FFT). This novel methodoutperforms several widely used tools on multiple specieswith more robust and reliable performances.

ii. The machine learning model used by LncFinder is determinedwith comprehensive comparisons. Five classifiers, logistic re-gression, SVM, random forest, ELM and deep learning areevaluated with parameter tuning. The classifier is constructedwith the algorithm that obtains the highest accuracy.

iii. LncFinder is highly flexible and remarkably user-friendly.Almost all classic alignment-free features proposed byother methods can be extracted with LncFinder. As a one-stop platform, LncFinder can complete feature extraction,feature selection, classifier construction and performanceevaluation easily and efficiently. LncFinder’s customizationof features and machine learning algorithm will effectivelyfacilitate research on poorly explored species and lncRNAproperties analysis. The support of parallel computing willalso greatly accelerate the process of feature selection andclassifier construction.

iv. LncFinder is readily accessible and convenient to use.Virtually all lncRNA identification tools require UNIX/Linuxoperating system (OS) and several hundred megabytes(MB), even several gigabytes (GB), of storage space to com-pute the sequences locally, whereas LncFinder is releasedas R package and is compatible with almost all widely usedOS platforms, such as Windows, UNIX/Linux and Mac OS X.Accepted by Comprehensive R Archive Network (CRAN),LncFinder can be installed conveniently in R with only onecommand, and the size of LncFinder is only 2.7 MB. In add-ition, a web server is also developed to provide a practicaland effective alternative for lncRNA identification. The webserver can classify lncRNAs of multiple species and calcu-late sequence coding potential. Informative lncRNA-relatedtools, databases and research progress are also summarizedon our web server for users’ reference. The summaries haverevised some outdated details and are updated regularly.

Feature exploration

In this article, we discuss the discriminating power of threekinds of feature categories, especially secondary structural andphysicochemical features. Classical features employed by exist-ing tools are reviewed, and new features are also designed tooffer a new perspective on lncRNA identification. The frame-work of this research is displayed in Figure 1.

Features are evaluated using the data sets of multiple spe-cies. Data sets of human and mouse used in our experimentsare the same as those of Achawanantakun et al., 2015 [28],which are collected from GENCODE [2, 52] and experimentallyverified data [3]. Data sets of wheat, chicken and zebrafish arecollected from Ensembl [53]. In these data sets, only one tran-script from each gene is used. Detailed information has beensummarized in Table S1 (Supplementary File 1 - Methods). Alldata sets can be downloaded from our web server.

Features of sequence intrinsic composition

Several studies have demonstrated that the distribution ofadjoining bases is different in non-coding RNAs (ncRNAs) andprotein-coding transcripts [24, 31]. The most general method tocapture the distribution differences is k-mer scheme, which isemployed by CNCI (k¼ 3), PLEK (k¼ 1–5), DeepLNC (k¼ 2, 3, 5)and some other tools. Nevertheless, the feature number risesdramatically with the increase of k. Considering that protein-coding transcripts are finally translated into amino acidsequences, the combination of two adjoining amino acidsshould have some patterns, hence a biased usage of thesenucleotides (A, C, G and T). We can therefore distinguishlncRNAs from protein-coding transcripts by measuring hex-amer usage. However, there will be 46 hexamer features if weextract features utilizing k-mer scheme. Too many features will

4 | Han et al.


Deleted Text: utilized

Deleted Text: Distance


Deleted Text: scheme


Deleted Text: the

Deleted Text: ,


Deleted Text: -


Deleted Text: ,

Deleted Text: -

Deleted Text: which

Deleted Text: as well as

Deleted Text: Comparing

Deleted Text: (1)

Deleted Text: ,

Deleted Text: Fast

Deleted Text: Transforms

Deleted Text: -

Deleted Text: (2)


Deleted Text: ,

Deleted Text: (3)

Deleted Text: ,

Deleted Text: the

Deleted Text: -

Deleted Text: (4)

Deleted Text: -

Deleted Text: ,

Deleted Text: ,

Deleted Text: ,

Deleted Text: The d

https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bby065#supplementary-data


Deleted Text: &hx223C;

Deleted Text: ,

Figure 1. Framework of this study. Data sets used in our experiments are collected from GENCODE and Ensembl. Only one transcript from each gene is used. In addition

to sequence-intrinsic composition, features are also extracted from multi-scale secondary structure and EIIP-based physicochemical property using two feature selec-

tion schemes. Evaluated with 10-fold CV and ROC curve, the optimal feature combination and machine learning algorithm are obtained to develop a new method for

lncRNA identification. This method is benchmarked against five popular tools on five species, and it is finally included in LncFinder, which is a highly flexible package

for lncRNA identification and analysis. LncFinder is published as R package as well as web server.

LncFinder | 5


lead to extremely cumbersome processes of classification andclassifier construction.

Hexamer score [24] of CPAT is a useful way to measure hex-amer frequencies without a large number of features, but thismethod only computes the average hexamer frequencies of thetraining data sets. For any unknown transcript, the hexamerpatterns are scanned, but their frequencies are abandoned.Here, we propose the following two new measurements toquantify the usage bias of hexamer: Euclidean-distance andLogarithm-distance. Each scheme has three features: distanceto lncRNA (Dist.LNC), distance to protein-coding transcript(Dist.PCT) and distance ratio (Dist.Ratio). We define these fea-tures as follows:

EucDist:LNC ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXðfreq:seqðiÞ � freq:lncðiÞÞ2

q;

log Dist:LNC ¼ 1n

Xln

freq:seqðiÞfreq:lncðiÞ ; i ¼ 1; 2; 3; . . . ;4k;

EucDist:Ratio ¼ EucDist:LNCEucDist:PCT

; log Dist:Ratio ¼ log Dist:LNClog Dist:PCT

;

where freq.seq are the k-adjoining base(s) frequencies of oneunevaluated sequence; freq.lnc denotes the average frequenciesof lncRNAs’ k-adjoining base(s); i denotes the different types ofk-adjoining base(s), and n is the total number of the k-adjoiningbase(s) in one sequence. Based on first two equations,EucDist.LNC and LogDist.LNC can be computed. EucDist.PCT andLogDist.PCT can be obtained similarly. The main idea underlyingthe proposed measurements is to estimate the unevaluated se-quence is ‘close to’ lncRNA or protein-coding sequence. The twomeasurements and hexamer score will be evaluated in ourexperiments with 10-fold cross validation (10-fold CV). Becauseboth the hexamer frequencies in training set and unevaluated

sequences are considered by our measurements, we expectthey can display more stable performances than hexamer scoreon multiple species.

For protein-coding transcripts, the longest ORF closelyresembles coding sequence (CDS), which is the region that canbe translated to amino acids. Although lncRNAs are non-codingtranscripts, the longest ORF can also be regarded as the mostpotential region for encoding amino acids. Because the hexam-ers in CDS encode amino acids for some purposes, we expectthat calculating hexamer frequencies on the longest ORF regionis more sensible than on full sequence. Thus, we will evaluatethe performances of three schemes (Euclidean-distance,Logarithm-distance and hexamer score) on the full sequence aswell as on the longest ORF region. The sliding window will slide1 nt each step on full sequence but slide 3 nt each step on thelongest ORF region to simulate the translation process.

Features of secondary structural information

The secondary structure plays important roles in multiple bio-logical functions and is considered more conserved than pri-mary sequence [54, 55]. But seldom has structural informationbeen employed to predict lncRNA. To explore the discriminatingpower of this category, we here introduce multi-scale secondarystructural features that portray the structural information ofone RNA sequence from the following three levels: stability, sec-ondary structure elements (SSEs) combined with pairing condi-tion and structure-nucleotide sequences. RNA secondarystructures can be obtained from program RNAfold ofViennaRNA Package [56], which calculates secondary structuresusing the minimum free energy (MFE)-based algorithm.

MFE is a basic structural outline displaying the stability ofthe RNA structure and is thus being selected as the low-scalefeature. Although only few lncRNAs are unstable, lncRNAs are,on average, less stable than mRNAs [57]. The box plots of theMFE of lncRNAs and mRNAs are displayed in Figure S1-1 (a),which show that mRNAs generally tend to possess lower MFE.

Figure 2. Illustration of multi-scale secondary structure-derived sequences. As a low-scale feature, MFE is a basic structural outline presenting one RNA sequence’s sta-

bility. Medium-scale sequences briefly sketch RNA information and can be obtained from dot-bracket notation sequence alone without using sequence nucleotide

composition. High-scale sequences can be viewed as a high-resolution panorama displaying the integration of sequence and structural information.

6 | Han et al.






Deleted Text: Protein

Deleted Text: Coding

Deleted Text: Transcript

Deleted Text: ,


Deleted Text: Ratio

Deleted Text: , respectively

Deleted Text: &hx201D;


Deleted Text: Since

Deleted Text: are



Deleted Text: one

Deleted Text: three

Deleted Text: Secondary

Deleted Text: which

Deleted Text: Minimum free energy (

Deleted Text: )


Deleted Text: a minority of


To extract features from higher levels, we first design sixmulti-scale secondary structure-derived sequences. Let seq[n]be an RNA sequence of length N, and the nucleotides aredenoted in lowercase ðseq½n� 2 fa; c; g;ugÞ. Let SS[n] be the sec-ondary structure sequence of seq[n], and SS[n] is defined usingdot-bracket notation

�SS½n� 2 f : ; ð ; Þ g

�. SSEs can depict one

RNA’s basic components, and here we employ the followingfour SSEs: stem (s), bulge (b), loop (l) and hairpin (h). Figure 2illustrates how six multi-scale secondary structural sequencesare obtained from seq[n] and SS[n]. Replacing nucleotides ofseq[n] with corresponding SSEs, we can obtain the first second-ary structure-derived sequence—SSE full sequence (SSE.FullSeq). Regarding the continuous identical SSE as one SSE, anothersequence—SSE abbreviated sequence (SSE.Abbr Seq)—isobtained. In the dot-bracket notation, a dot ‘.’ means unpairedbase and brackets ‘(‘or’)’ represent paired base (also the SSEstem). Thus, SS[n] can be converted to Paired-Unpaired Seq usingthe following formula:

Paired� Unpaired Seq½n� ¼(

U; SS½n� ¼ �

P; SS½n� 6¼ �:

These three sequences can sketch basic RNA informationand be viewed as medium-scale structural information that isconverted from SS[n] alone without using the nucleotide com-position of seq[n].

Just like observing an object using different magnifiers, wecan perceive more details with a high-power magnifier. On ahigh-scale level, three secondary structure-derived sequences,namely, acgu-Dot Sequence (acguD Seq), acgu-Stem Sequence(acguS Seq) and acgu-ACGU Seq, can be obtained by combiningsecondary structure sequence SS[n] and primary sequenceseq[n]:

acguD Seq n½ � ¼(

D; SS n½ � ¼ �

seq n½ �; SS n½ � 6¼ �;

acguS Seq n½ � ¼(

seq n½ �; SS n½ � ¼ �

S; SS n½ � 6¼ �;

acgu�ACGU Seq n½ � ¼

A; seq n½ � ¼ a ^ SS n½ � 6¼ �

C; seq n½ � ¼ c ^ SS n½ � 6¼ �

G; seq n½ � ¼ g ^ SS n½ � 6¼ �

U; seq n½ � ¼ u ^ SS n½ � 6¼ �

seq n½ �; SS n½ � ¼ �

8>>>>>>>><>>>>>>>>:

:

In acguD Seq, unpaired nucleotides are replaced with charac-ter ‘D’, thus acguD Seq can be regarded as a portrait describingthe percentage of the unpaired base and the intrinsic compos-ition of the SSE stem. Similarly, acguS Seq can be viewed as aportrait serving the complementary roles. The third sequenceacgu-ACGU Seq is obtained by converting nucleotides of seq[n]into uppercase if they are paired bases. The combination ofthese three sequences can be considered a high-resolutionpanorama presenting the integration of sequence and structuralinformation.

In our study, two strategies, improved k-mer scheme [58]and Logarithm-distance of k-adjoining bases, are employed to

extract features from six multi-scale secondary structuralsequences. The optimal k will be determined by 10-fold CV.

Features of EIIP-derived physicochemical features

CPC2 uses pI values to reveal the physicochemical differencesbetween lncRNAs and protein-coding transcripts. CPC2attempts to theoretically translate RNA sequence into proteinsequence and applies pI values to the new obtained protein se-quence. In this article, we explore the physicochemical propertyfrom another viewpoint, namely, EIIP values. EIIP was initiallyused to locate exons. Each nucleotide (a, c, g and t) has one EIIPvalue, and these values indicate the energy of delocalized elec-tions in nucleotides [51]. For any DNA sequence, nucleotidescan be replaced with the following EIIP values:fa! 0:1260; c! 0:1340; g! 0:0806; t! 0:1335g. Compared withpI values, EIIP values are directly applied to RNA sequences,which can avoid the potential bias caused by the speculatedtranslation process.

Let Xe[n] be the EIIP indicator sequence of Seq[n]. Using FFTon Xe[n], we can get the corresponding power spectrumfSe k½ �g k ¼ 0; 1; 2; . . . ;N� 1ð Þ:

Xe k½ � ¼XN�1

n¼0

Xe n½ �e�j2pknN ; Se k½ � ¼ jXe k½ �j2:

For protein-coding transcripts, an obvious peak usuallyappears at the N=3 position, but no such peak can be found innon-coding transcripts [59] (see Figure S1-2 for example).Moreover, the power of the protein-coding transcript is general-ly higher than that of lncRNA. Thus, we can capture these dif-ferences with the following features: the signal at 1/3 position(Se

N3

� �), average power (�E) and signal-to-noise ratio (SNR). �E and

SNR are defined as follows:

�E ¼PN�1

k¼0 Se k½ �N

; SNR ¼Se

N3

� ��E

:

From the box plots in Figure S1-1 (b, c, d), it can be noted thatmost of lncRNAs possess lower Se

N3

� �; �E and SNR values.

We additionally sort the power spectrum in descendingorder and calculate the quantiles statistics (Q1, Q2, Q3, mini-mum and maximum values) of power values on differentranges. The ranges are designed with two different ways. Theranges in the first group varies from the top 10 to top 100 of thesorted power spectrum, and the ranges are also from the top10% to 100% of the sorted power spectrum. As the signals ofmRNAs are generally stronger than those of lncRNAs, protein-coding transcripts should tend to have higher values of quan-tiles statistics than lncRNAs. EIIP-based features embody thephysicochemical as well as 3-base periodicity properties ofprotein-coding sequences [60, 61], and we anticipate that fea-tures from this category can present robust results on non-model data sets.

Feature selection and model validation

Feature selection is conducted with 10-fold CV to determine theoptimal feature extraction scheme as well as to evaluate the per-formance of different feature groups. The performances are eval-uated with the following five standard metrics: sensitivity,specificity, accuracy, F-measure and Cohen’s kappa coefficient [62].

LncFinder | 7


Deleted Text: Dot

Deleted Text: Bracket

Deleted Text: Notation

Deleted Text: ,

Deleted Text: :

Deleted Text: :

Deleted Text: )

Deleted Text: Dot

Deleted Text: Bracket

Deleted Text: Notation







Deleted Text: :

Deleted Text: ,



Deleted Text: ,

Deleted Text: the

Deleted Text: -Distance



Deleted Text: ,

Deleted Text: signal

Deleted Text: to


Deleted Text: ,

Deleted Text: ,

Deleted Text: that

Deleted Text: ,

Deleted Text: Kappa

Deleted Text: Coefficient

Sensitivity ¼ TPTPþ FN

; Specificity ¼ TNTNþ FP

;

Accuracy ¼ TPþ TNPþ N

; Kappa ¼ Pr oð Þ � Pr eð Þ1� Pr eð Þ ;

F�Measure ¼ 2� TP2� TPþ FPþ FN

:

In Cohen’s kappa coefficient, Pr(o) denotes the proportion ofunits in which the judges agreed, and Pr(e) is the proportion ofunits for which agreement is expected by chance. In our evalu-ation, lncRNAs are labeled as positive class, and protein-codingtranscripts are labeled as negative class. Based on the results ofRFE and our feature selection algorithm (see Algorithm S3), wecan finally obtain the optimal feature combination, whichcomprises 19 features (see Table 2).

We integrated these features into the models built with sev-eral widely used machine learning algorithms to assess the per-formance of the features as well as the machine learningalgorithms. Logistic regression [63], SVM [64, 65], random forest[66], ELM [67, 68] and deep learning [69] were evaluated with 10-fold CV in our experiments. According to the results, SVM wassuperior to other models, but the differences in the performan-ces were quite subtle (see Result Section).

Result analysis

Feature selection and model validation are conducted usinghuman data set A with 10-fold CV. The selected features andclassic alignment-free features are evaluated on human dataset B. After determining the optimal feature combination andmachine learning algorithm, we actually obtain a novel lncRNAidentification method. Our method is benchmarked against fivepopular machine learning-based lncRNA identification tools,namely, CPC, CPAT, CNCI, PLEK and CPC2, on human data set B,as well as data sets mouse, wheat, zebrafish and chicken to as-sess each tool’s performances and stabilities.

Feature selection

Features discussed in this article can be divided into the fol-lowing three groups: sequence-intrinsic composition, second-ary structural information and EIIP-based physicochemicalproperty. For the sequence-derived features, features ofLogarithm-distance on ORF region achieved an accuracy of0.9598, while features of Euclidean-distance on ORF regionobtained 0.9596. On the whole sequence, Logarithm-distancefeatures obtained an accuracy of 0.8521 and Euclidean-distance features obtained 0.8484. The difference between thetwo measurements’ performances was minor, thoughLogarithm-distance had higher accuracy and F-measure (seeFigure S2-1, Supplementary File 1 for detailed information). Asthe most discriminating feature of CPAT, the hexamer scorehad an accuracy of 0.8458 and 0.6416 on ORF region and thewhole sequence, respectively. Measurements of Logarithm-distance and Euclidean-distance greatly outperform CPAT’shexamer score. We further combined Logarithm-distance fea-tures with the following two classic ORF-related features: thelength and coverage of the longest ORF. The RFE results aredisplayed in Table S2-1 (Supplementary File 1). None of thefive features is redundant. High importance scores ofLogarithm-distance features (see Table S2-2) indicate thatthese three features are of critical importance in the lncRNAidentification. Only five features, Logarithm-distance of hex-amer on ORF region (consists of three features, namely,LogDist.LNC, LogDist.PCT and LogDist.Ratio), the length andcoverage of the longest ORF, can highly represent the se-quence-intrinsic information of one RNA sequence. Fivesequence-derived features presented an accuracy of 0.9630and an F-measure of 0.9628 on human data set A (Table 3).

Figures S2-2 to S2-4 show the performances of k-mer fea-tures extracted from multi-scale secondary structure-derivedsequences. It seems that features based on the k-mer schemedisplayed a passable result. Nonetheless, the accuracy droppedwhen secondary structure-based features were combined withsequence feature group (see Table S2-3). Figures S2-5 and S2-6display the performances of multi-scale secondary structuralfeatures extracting with Logarithm-distance measurement.Except for subgroup SSE.Abbr Seq, the performances showed nomajor difference with those of k-mer features, but the featuresare refined, and the feature number of each subgroup is reducedto 3. Moreover, Logarithm-distance features of subgroups acguDSeq, acguS Seq and acgu-ACGU Seq boosted the accuracy ofsequence-derived features (Table S2-4), which confirmed thediscriminating power of secondary structure and the feasibilityof Logarithm-distance measurement. Hence, we selectedscheme Logarithm-distance to extract features of these threesubgroups. Although subgroups SSE.Full Seq, SSE.Abbr Seq andPaired-Unpaired Seq, regardless of calculating k-mer frequenciesor Logarithm-distance, cannot improve the performance fur-ther, some useful information can still be extracted. According

Table 2. Features selected from three feature groups

Sequence-intrinsic composition Multi-scale structural information EIIP-based physicochemical property

Logarithm-distancea of hexamer on ORF Minimum free energy (MFE) Signal at 1/3 position (Se[N=3])Length of the longest ORF UP frequency of paired–unpaired sequence SNRCoverage of the longest ORF Logarithm-distancea of acguD sequence Quantile statistics (Q1, Q2, min and max)

Logarithm-distancea of acgu-ACGU sequence

aLogarithm-distance consists of three features: LogDist.LNC, LogDist.PCT and LogDist.Ratio.

Table 3. Performances of each feature group on human data set A

Feature group Sensitivity Specificity Accuracy F-measure Kappa

Sequence 0.9555 0.9705 0.9630 0.9628 0.9261SSa 0.8129 0.8921 0.8525 0.8464 0.7050EIIP 0.9021 0.8686 0.8853 0.8872 0.7706All features 0.9642 0.9726 0.9684 0.9682 0.9368

aMulti-scale structural features. The results are obtained from 10-fold CV. Bold

numbers indicate the highest value.

8 | Han et al.


Deleted Text: Kappa

Deleted Text: Coefficient

Deleted Text: is comprised

Deleted Text: of

Deleted Text: -

Deleted Text: ,

Deleted Text: ,

Deleted Text: ,


Deleted Text: ,













Deleted Text: -Distance



Deleted Text: :

Deleted Text: ,

Deleted Text: see




Deleted Text: ,




Deleted Text: ,


Deleted Text: there may still embrace

to the results of Figure S2-3, we calculated the importancescores of 4-mer frequencies of Paired-Unpaired Seq and 2-mer fre-quencies of SSE.Abbr Seq by conducting RFE algorithm, and thefeature with the highest score of each subgroup were includedin this feature group (see Table S2-5).

Now three feature subgroups (Logarithm-distance of acguDSeq, acguS Seq and acgu-ACGU Seq) and three features (MFE, UPfrequency of Paired-Unpaired Seq and bulge frequency ofSSE.Abbr Seq) derived from the multi-scale secondary structuremay enhance the performance of sequence-derived feature.However, it is still necessary to perform feature selection to de-termine the optimal feature combination. Because the informa-tion of secondary structure-derived sequence has beenembodied in Logarithm-distance features, we avoid selectingkey features by performing RFE algorithm, which may detachthree Logarithm-distance features. We re-evaluated the featuregroup, which consists of three subgroups and three featureswith a new algorithm displayed in Algorithm S2. This algorithm

ranks different feature groups according to their average per-formances of 10-fold CV. For each iteration, the feature groupthat shows the best improvement in accuracy will be added tothe selected feature set. One feature group alone may not im-prove the performance, but it may boost the result by combin-ing with other feature groups. To avoid missing potentialfeature groups, all the feature groups with the highest score ofeach iteration will be evaluated. The final results of feature se-lection were summarized in Table S2-6. Eight multi-scale sec-ondary structural features were determined, namely, MFE, UPfrequency of Paired-Unpaired Seq, Logarithm-distance of acguDSeq and acgu-ACGU Seq (See Table 2). Eight secondary structuralfeatures obtained an accuracy of 0.8525 and an F-measure of0.8464 on human data set A.

As to the features based on EIIP values, subgroup quantilestatistics on the position of top 10% presented the best per-formance in our experiments (see Figure S2-7). Depending onthe results of RFE, the signal at 1/3 position (Se

N3

� �), SNR and

A B C

D E F

Figure 3. ROC curves of different feature groups and different tools on three species. (A) Sequence-derived (Seq Group), EIIP-derived (EIIP Group), secondary structure-

derived (SS Group) and other six classic feature groups were evaluated on human data set B. All three feature categories we proposed were among the top five feature

groups. Logarithm-distance features outperformed other sequence-intrinsic features such as codon bias and hexamer score with the highest AUC. Six EIIP-based fea-

tures even had performance comparable to that of 64 codon bias features. (B) Nine feature groups were extracted from the training set of human data set B and were

used to build classifiers. Figure (B) shows the nine classifiers’ performances on mouse test set. All feature groups showed some fluctuations in performances, but se-

quence-derive features still achieved the best AUC. (C) Classifiers built on human data set B were evaluated with a test set of wheat data set. Compared with Figure 3

(A), AUC of codon bias features decreased about 18%, while AUC of hexamer score decreased about 30%. EIIP-based feature group surpassed codon bias features and

demonstrated its satisfactory cross-species performance. Sequence-derived features still obtained the best AUC. (D) LncFinder and other five tools, namely, CPC (offline

version), CPAT (re-trained model), CNCI, PLEK (re-trained model) and CPC2, were tested on human data set B. LncFinder and CPAT had the best AUC, but the accuracy

of CPAT was lower than that of LncFinder. (E) LncFinder and other five tools were tested on mouse data set. LncFinder achieved the best result. (F) LncFinder and other

five tools were tested on wheat data set. LncFinder achieved the best AUC on human and mouse data sets. Although the accuracy of CPC on human and mouse data

sets was inferior to that of other tools, CPC surpassed all alignment-free tools on wheat data set. LncFinder had the best performance among alignment-free tools. We

cannot know which tool is best for one specific species in advance. A tool that can present robust and stable results on multiple species is of crucial importance.

LncFinder had the most stable and reliable performances among these tools.

LncFinder | 9





Deleted Text: ,

Deleted Text: ,

Deleted Text: Since






Deleted Text: ,

quantile statistics (Q1, Q2, min and max values) wereselected as EIIP-based features (see Tables S2-7 for RFE re-sult). Six EIIP-based physicochemical features achieved anaccuracy of 0.8853 and an F-measure of 0.8872 on humandata set A.

To assess the performances and cross-species’ stabilities ofdifferent feature groups, three feature groups we designed andsix other classic alignment-free feature groups (codon bias, hex-amer score, Fickett TESTCODE score of CPAT and CPC2, pI andtranscript length) were evaluated on the following three species:human, mouse and wheat. All feature groups were used to buildSVM classifiers separately with the training set of human data setB. Then the classifiers built with human data set were used topredict the test set of human data set B and the test sets of mouseand wheat. Feature groups’ receiver operating characteristic(ROC) curve [70, 71] on three test sets were shown in Figure 3 (A),(B), and (C). Figure 3 (A) displays each feature group’s perform-ance on human species, while Figure 3 (B) and (C) presents theircross-species stabilities. From Figure 3 (A), it can be observed thatthe top five feature groups are Logarithm-distance features (Seqgroup), codon bias, EIIP-based features, hexamer score of CPATand multi-scale secondary structural features (SS group). Threefeature groups among the top five were extracted from sequence-intrinsic composition. The sequence-derived features wedesigned outperformed other sequence-intrinsic features such ascodon bias and hexamer score with the AUC (area under curve) of0.991. Six EIIP-based features had performance comparable tothat of 64 codon bias features. Secondary structural features sur-passed features’ transcript length, Fickett TESTCODE score and pIvalue with the AUC of 0.902, which demonstrates the discrimi-nating power of structural features. The Fickett TESTCODE scoremethods employed by CPAT and CPC have some minor differen-ces. CPAT calculates Fickett TESTCODE score on ORF region andobtains AUC 0.781. Figure 3 (B) and (C) shows the results of eachfeature groups on data sets of mouse and wheat. All featuregroups showed some fluctuations in their performances, butsequence-derived features still achieved the highest AUC.Sequence-derived features and EIIP-based features displayed bet-ter performances than other feature groups. Multi-scale second-ary structural features only had average cross-species results, butthis feature category was among the top five feature groups onhuman data set B. Based on a comprehensive evaluation of differ-ent feature groups, 19 critical features are selected fromsequence-intrinsic composition, multi-scale secondary structural

information and EIIP-based physicochemical property (seeTable 2). All three feature groups achieved an accuracy of 0.9684and an F-measure of 0.9682 on human data set A (see Table 3).Using LncFinder, users can extract various features and constructtheir own classifiers for different purposes.

Model validation

The results of five machine learning models are displayed inFigure S2-8. The parameters of different machine learning mod-els were tuned with 10-fold CV. The performances of eachmodel under different parameters are displayed in Table S3-16and Table S3-17.

The classifier based on SVM achieved the highest accuracy,0.9687, while deep learning had the lowest, 0.9523. In fact, mostof the models’ accuracies ranged from 0.965 to 0.968. The differ-ence of performances between the SVM model and the randomforest model was even negligible: the accuracy of the randomforest model was 0.9681. The stable results of different classi-fiers reflect that the critical features we designed are of a highstandard and classifier-neutral. SVM had the best accuracy, andrandom forest achieved the best F-measure. In this experiment,we selected SVM to build the classifier. But researchers can alsouse LncFinder to construct models with other machine learningalgorithms. The detailed procedures and results of feature se-lection and model validation are included in the Result Sectionin Supplementary File 1. After evaluating the features andobtaining the SVM classifier, we obtain a novel lncRNA predict-or. In the next section, we will benchmark our predictor againstseveral widely used tools to further evaluate the discriminatingpower of different methods.

Evaluations by comparison with popular tools

In this section, our lncRNA identification method was bench-marked against CPC, CPAT, CNCI, PLEK and CPC2 on five spe-cies, namely, human (Homo sapiens), mouse (Mus musculus),wheat (Triticum aestivum), zebrafish (Danio rerio) and chicken(Gallus gallus). The novel lncRNA identification method is one ofthe main functions of LncFinder package, and here we useLncFinder to denote the method we developed. In our experi-ments, we used UniRef90 [72] as the protein reference databaseof CPC. Because CPAT and PLEK can be trained with users’sequences, the re-trained models were built with the data sets

Figure 4. Performances of different tools on human data set B. LncFinder had the best accuracy of 0.9728. CPC had a strong tendency to classify lncRNAs as protein-cod-

ing transcripts and thus having low accuracy of 0.8304 (web server). As an upgraded version, CPC2 presented accuracy of 0.9614. CPC2 was a big improvement on its

predecessor and also outperformed CNCI and PLEK that obtained accuracies of 0.9450 and 0.9274, respectively. CPAT (re-trained model) was inferior to only LncFinder

and obtained an accuracy of 0.9642. Even when secondary structure-derived features were excluded, LncFinder (Without.SS) can still surpass other tools with an accur-

acy of 0.9716.

10 | Han et al.


Deleted Text: Min

Deleted Text: ,

Deleted Text: Max

Deleted Text: other

Deleted Text: ,

Deleted Text: ,

Deleted Text: 5


Deleted Text: 5


Deleted Text: Area

Deleted Text: Under

Deleted Text: Curve

Deleted Text: a

Deleted Text: ,

Deleted Text: obtained

Deleted Text: 5


Deleted Text: ,





Deleted Text: Have

Deleted Text: evaluated

Deleted Text: obtained

Deleted Text: -

Deleted Text: ,

Deleted Text: ,

that are identical to the training sets of LncFinder. As suggestedin their documentations, the parameters of PLEK were tunedwith grid search, and the cut-off of CPAT was determined using10-fold CV. Both the new trained and pre-built models wereevaluated to have a comprehensive and fair comparison. CPC,CPAT and CPC2 provide a web server and a standalone version;both versions were tested in our experiments. The web server ofCPC2 presented the results that are identical to the standaloneversion. For CPC and CPAT, however, the results of the web ser-ver and standalone version showed some minor differences,which may result from different genome assemblies of thetraining set. Additionally, considering that the secondary struc-ture calculated by RNAfold may not present the actual struc-ture, LncFinder can be configured to predict lncRNAs using thestructural information provided by users or simply withoutusing the multi-scale secondary structural features. In ourevaluation, the structural features’ excluded version ofLncFinder was also benchmarked against other tools.

Performance evaluation of human data set BFigure 4 displays the performances of different tools on humandata set B. It can be noted that CPC had the best specificity (1.00of standalone version and 0.9988 of the web server). However,the accuracy (0.8318 of standalone version and 0.8304 of theweb server) was not that excellent owing to the low sensitivity(0.6636 of standalone version and 0.6620 of the web server). Asan alignment-based method, CPC is mainly designed to assessthe coding potential, and it is very useful to evaluate highly con-served protein-coding transcripts. Nevertheless, many lncRNAsoverlap protein-coding genes, which could make CPC incorrect-ly classify long non-coding transcripts as protein-codingsequences. As the upgraded version of CPC, CPC2 showed con-siderable improvement (accuracy, 0.9614; F-measure, 0.9610) onits predecessor. Compared with CPC, CPC2 achieved much bet-ter sensitivity and thus much better accuracy. CPAT had rela-tively high accuracy and F-measure on the web server, 0.9654and 0.9653, respectively. CNCI surpassed PLEK (accuracy of pre-build model, 0.9410) with an accuracy of 0.9450. Because the de-fault models were trained with a large scale of sequences,which may have some overlaps with our test sets, CPAT andPLEK were evaluated with the re-trained models as well. The ac-curacy of CPAT (re-trained model) was 0.9642 and the accuracyof PLEK was 0.9274. LncFinder achieved the highest accuracyand F-measure, 0.9728 and 0.9726, respectively. The high

accuracy and F-measure imply LncFinder is provided with a bet-ter balance between precision and recall.

Even when secondary structure-derived features wereexcluded, LncFinder still outperformed other tools. Figure 3 (D)displays the ROC curves of CPC (offline version), CPAT (re-trained model), CNCI, PLEK (re-trained model), CPC2 andLncFinder on human data set B. Both LncFinder and CPAT hadthe best AUC, but the accuracy of CPAT was lower than that ofLncFinder. For detailed data of the evaluation on human dataset B, please refer to Table S3-18.

Performance evaluation of mouse data setWe additionally evaluated the performance of different tools onthe mouse data set because it is one of the most studied species.CPC predicts sequences largely depending on the reference dataset; thus, CPC can be applied to various species with one defaultmodel. According to the manuals, the default models of CNCIand PLEK are competent to predict sequences of other verte-brate species; CPC2 is a species-neutral classification tool thatcan be used for non-model organism transcriptomes. We, there-fore, compared CNCI, PLEK (default model), CPC2 withLncFinder (default model for human) to have a fair evaluation.CPAT is the only alignment-free tool that has the pre-builtmodel for mouse, and both the default model for mouse and there-trained model were included in our tests.

Figure 5 displays the performances of different tools on themouse data set. CPC still obtained the highest specificity (0.9883of standalone version), but the accuracy (0.8750 of standaloneversion) was affected by the low sensitivity (0.7617 of stand-alone version). CPC2, however, had a high sensitivity of 0.9289;because of its poor specificity of 0.7933, it could obtain an accur-acy of only 0.8611. CPAT with a re-trained model obtained an ac-curacy of 0.9242, while PLEK with re-trained model had anaccuracy of 0.8178. LncFinder achieved the best result with anaccuracy 0.9347 and an F-measure of 0.9360, which indicates itssatisfactory overall performance. Furthermore, LncFinder (with-out secondary structure-derived features) surpassed other toolswith an accuracy of 0.9286. When the model for human wasused, LncFinder achieved an accuracy 0.9186 and an F-measure0.9207, which still surpassed other tools’ default models(accuracy of PLEK, 0.8025; accuracy of CPC2, 0.8611; accuracy ofCNCI, 0.9133; accuracy of CPAT’s web server, 0.9161). Althoughusing the model for the human data set, LncFinder was only in-ferior to the re-trained model of CPAT. Figure 3 (E) displays the

Figure 5. Performances of different tools on mouse data set. LncFinder achieved the best accuracy of 0.9347, while PLEK (re-trained model) had an accuracy 0.8178. The

accuracy of CPAT (re-trained model) was 0.9242 and better than CNCI’s 0.9133, CPC’s 0.8678 (web server) and CPC2’s 0.8611. LncFinder (Without.SS) outperformed other

tools with an accuracy of 0.9286 even without secondary structure-derived features. When using the model for human, LncFinder outperforms CPC/CPC2, CPAT (web

server), CNCI and PLEK (re-trained model) with a satisfactory accuracy of 0.9186. Under this circumstance, LncFinder can even rival CPAT (re-trained model for mouse),

which demonstrates LncFinder’s robustness and high cross-species stability.

LncFinder | 11


Deleted Text: ,

Deleted Text: two

Deleted Text: assembly

Deleted Text: due

Deleted Text: with

Deleted Text: incorrectly

Deleted Text: :

Deleted Text: ,

Deleted Text: :

Deleted Text: :

Deleted Text: excluded

Deleted Text: ,

Deleted Text: a


Deleted Text: depends

Deleted Text: ,

Deleted Text: which helps

Deleted Text: and

Deleted Text: is

Deleted Text: feasible

Deleted Text: of mouse

Deleted Text: though

Deleted Text: made

Deleted Text: only obtain

Deleted Text: the

Deleted Text: :

Deleted Text: ,

Deleted Text: the

Deleted Text: :

Deleted Text: ,

Deleted Text: the

Deleted Text: :

Deleted Text: ,

Deleted Text: the

Deleted Text: :

Deleted Text: Though

ROC curves of CPC (offline version), CPAT (re-trained model),CNCI, PLEK (re-trained model), CPC2 and LncFinder on themouse data set. LncFinder had the best AUC and presented asatisfactory trade-off between sensitivity and false-positive rate(FPR, 1-specificity). CNCI and PLEK had much higher FPR andlower AUC. The original data of this evaluation are listed inTable S3-19.

Performance evaluation of wheat data setWe further compared different tools on plant data set. The dataset was constructed with the sequences of wheat because of itssufficient lncRNA sequences. According to the manuals of CNCIand PLEK, both tools provide models for plant sequences predic-tion. Thus, their pre-built models for plants were included inour tests. Because CPAT has no model for plant species, we add-itionally compared CPAT with LncFinder by employing their de-fault models that are built with human data sets.

Figure 6 shows the performances of different tools on thewheat data set. CPC outperformed all the alignment-free toolswith the highest accuracy and F-measure, 0.9595 and 0.9585, re-spectively. Its successor, CPC2, nonetheless, had an accuracy of0.7870 and an F-measure of 0.7560. The accuracy of CNCI (de-fault model for plant) was 0.6158, and the accuracy of PLEK (de-fault model for plant) was 0.5275. Although CNCI and PLEKprovide models for plant, their results were not that favorable.Using the re-trained model, the accuracy of PLEK increasedfrom 0.5275 to 0.8773. The performance of CPAT (re-trained

model) was slightly inferior to that of PLEK (re-trained model)with an accuracy of 0.8743. LncFinder obtained the best per-formance among alignment-free tools with an accuracy of0.9283. LncFinder also presented satisfactory sensitivity andspecificity. From Figure 6, it can be seen that all the tools hadhigh specificity, but different tools had various sensitivity.When each tool’s default model was used for this test set,LncFinder (default model for human) had the best sensitivity of0.7000, while PLEK (default model for plant) only got 0.0550.Both LncFinder and CPAT were tested using their default mod-els for the human sequences. The performance of LncFinder(accuracy, 0.8190; F-measure, 0.7946) was much better than thatof CPAT (accuracy, 0.7145; F-measure, 0.6188). CPAT employs lo-gistic regression, and the best cutoffs of different species varyconsiderably. CPAT’s suggested cutoff for human is 0.364, whilethe best cutoff for mouse is 0.440. In our experiments, the opti-mal cutoff for wheat even reached 0.537. An inappropriate cut-off can lead to an inferior performance. Figure 3 (F) displays ROCcurves of CPC (offline version), CPAT (re-trained model), CNCI,PLEK (re-trained model), CPC2 and LncFinder on the wheat dataset. CPC surpassed all alignment-free tools on wheat, althoughit presented unsatisfactory results on human and mouse.Among alignment-free tools, LncFinder achieved the best AUCof 0.983. It is reasonable to assume that the poor performance ofCNCI can be ameliorated if CNCI can be re-trained with newdata sets. The original data of the evaluation of wheat are listedin Table S3-20.

Figure 6. Performances of different tools on wheat data set. Although CPC had inferior performances on human and mouse, it achieved the best accuracy on wheat.

CPC obtained an accuracy of 0.9595, but his alignment-free successor CPC2 only had an accuracy of 0.7870. The accuracies of CPAT (re-trained model) and PLEK (re-

trained model) were 0.8743 and 0.8773, respectively, while LncFinder obtained an accuracy of 0.9283. When default models were used, CPAT (model for human), CNCI

(default model for plants) and PLEK (default model for plants) had accuracies of 0.7145, 0.6158 and 0.5275, respectively. LncFinder (model for human) had an accuracy

of 0.8190. Although CNCI and PLEK provide default models for plants, the performances were substandard. LncFinder has the best performance among alignment-free

tools. Even using the model for human, LncFinder still outperformed CPC2, CNCI and the default models of CPAT and PLEK.

Table 4. Performances of different tools on zebrafish and chicken data sets

Methods Zebrafish (Danio rerio) Chicken (Gallus gallus)

Sensitivity Specificity Accuracy F-measure Kappa Sensitivity Specificity Accuracy F-measure Kappa

CPC 0.6728 NA NA NA NA 0.5784 0.9888 0.7836 0.7277 0.5671CPAT 0.8668 0.8660 0.8664 0.8663 0.7328 0.9189 0.9178 0.9183 0.9183 0.8366CNCI 0.8535 0.8728 0.8631 0.8618 0.7263 0.9128 0.9051 0.9089 0.9093 0.8179PLEK 0.8715 0.8255 0.8485 0.8519 0.6970 0.9346 0.9124 0.9235 0.9244 0.8740CPC2 0.8948 0.7835 0.8391 0.8476 0.6783 0.7650 0.9235 0.8443 0.8308 0.6885LncFinder 0.8815 0.8838 0.8826 0.8825 0.7653 0.9491 0.9321 0.9406 0.9411 0.8813

Bold numbers indicate the highest value. LncFinder has the best performance. In our test, CPC could not process the protein-coding transcripts of zebrafish; thus, only

the result of lncRNAs is obtained.

12 | Han et al.


Deleted Text: ,

Deleted Text: false


Deleted Text: two

Deleted Text: which

Deleted Text: Utilizing


Deleted Text: :

Deleted Text: ,

Deleted Text: :

Deleted Text: :

Deleted Text: ,

Deleted Text: :

Deleted Text: -

Deleted Text: -

Deleted Text: -

Deleted Text: -

Deleted Text: ,


Performance evaluation of zebrafish and chicken data setsWe finally evaluated the stabilities and performances of CPC,CPAT, CNCI, PLEK, CPC2 and LncFinder on zebrafish and chickendata sets. The results are displayed in Table 4.

Because CPC needs to align the sequences against the refer-ence database and CNCI has to calculate the most-like CDS(MLCDS), these two tools have strict requirements for sequencequality. For the sequences containing some non-nucleotidecharacters (such as ‘X’), which are very common for somepoorly explored species, CPC may throw an error and stop thecomputation, and CNCI may omit these sequences automatical-ly. In this test, tools CPAT, PLEK and LncFinder functioned nor-mally, but CPC could not identify the protein-coding transcriptsof zebrafish. Thus, only the result of lncRNAs was obtained. Wealso noticed that CNCI omitted 7 lncRNAs and 6 protein-codingtranscripts of chicken and 13 protein-coding transcripts ofzebrafish automatically.

From Table 4, it can be observed that LncFinder outper-formed other tools with the highest accuracy and F-measure.The tool CPC had the best performance on wheat, but the sensi-tivity of CPC was much lower on human, mouse, zebrafish andchicken than the sensitivity of other tools; therefore, CPC hadlow accuracy. CPC2 had much better overall performance thanCPC. For the zebrafish data set, CPAT achieved an accuracy of0.8664 and was better than CNCI, PLEK and CPC2. But LncFindersurpassed CPAT with an accuracy of 0.8826. PLEK performedbetter than CPC, CPAT, CNCI and CPC2 on the chicken data setand had an accuracy of 0.9235. But LncFinder obtained about1.7% higher accuracy than PLEK. According to the results of thefive species, LncFinder displayed the most stable and satisfac-tory performance. The robustness and fault-tolerance capabilitymake LncFinder a valuable and practical lncRNA identificationtool for multiple species, especially for those poorly exploredspecies.

Evaluation of computational speedThe running times of six tools were evaluated on the same plat-form. We here avoid using large servers for computationalspeed evaluation. An average hardware environment can assesseach tool’s efficiency and usability much clearly. The platformconfigurations are Intel[textregistered] CoreTM i7-2600 processor@ 3.40 GHz, 8 GB memory and 64 bits Linux OS. Human data setB, which contains 2500 long non-coding transcripts and 2500protein-coding transcripts, was used to evaluate six tools. CPC2used 8.87 seconds to complete the prediction, while CPC needed4675.45 min to complete the process of alignment and identifi-cation. CPAT was only slightly inferior to CPC2 and used 9.05 sto identify 5000 sequences. With the help of parallel computing,it took CNCI, PLEK and LncFinder 1333.19 s, 83.67 s and 56.01 s,respectively to complete the identification. If predicting sequen-ces without using secondary structure features, it tookLncFinder 35.76 s to finish the process. LncFinder is more effi-cient than CPC, CNCI and PLEK. Although slower than CPC2 andCPAT, LncFinder can still predict several thousand sequenceswithin 1 min and present more reliable results. For detaileddata, please refer to Table S2-8.

Discussion

In this study, we reviewed several widely used lncRNA identifi-cation tools and their features. Numerous alignment-free fea-ture groups, such as codon bias, Fickett TESTCODE score and pIwere evaluated on three data sets to assess their performancesand cross-species stabilities. Additionally, we also

comprehensively explored the following three feature catego-ries: sequence-intrinsic composition, multi-scale secondarystructural information and physicochemical features obtainedfrom EIIP and FFT. Based on the feature selection process, 19heterologous features were extracted. We incorporated the 19features into the following five popular machine learning algo-rithms: logistic regression, SVM, random forest, ELM and deeplearning to validate the heterologous features we designed aswell as assess the effect of different machine learning algo-rithms on lncRNA prediction. The stable performances of differ-ent classifiers indicated that the features are critical andreliable. According to the experiments’ results, we proposed anovel lncRNA identification method. Benchmarked against sev-eral state-of-the-art tools, our method displayed more accurateand stable performances on multiple species with acceptabletime costs. An integrated package LncFinder is finally estab-lished to facilitate the research on lncRNA. Various classic fea-tures as well as features we designed can be extracted withLncFinder. Users can use LncFinder to build the predictor withother feature groups or machine learning algorithms. As a one-stop package for lncRNA identification and analysis, LncFindercan effectively and efficiently complete the main steps of pre-dictor construction including feature extraction, feature selec-tion, model construct and performance evaluation. LncFinderwas released as R package. To maximize its availability, a webserver was also developed for lncRNA prediction.

Euclidean-/Logarithm-distance, two new measurements,were designed to capture the sequence-intrinsic composition.Compared with other sequence-derived features, Logarithm-distance can achieve high accuracy as well as simplify the fea-tures markedly. Our designed multi-scale structural featurescapture structural information at different resolution levels byintegrating sequence composition with MFE and structuralsequences. EIIP-derived features based on FFT can provide an-other view from the prospect of physicochemical property. Thesequence-derived features are based upon linguistic meaning,whereas the features extracted from the secondary structureand EIIP can be further interpreted as semantic annotations,which implies higher-level information of biological functions.

According to our experiments, features of Logarithm-distance of hexamer on ORF region performed satisfactorilywith an accuracy of 0.9598, and the accuracy of all features fromthree categories combined was 0.9687 with parameter tuning. Itseems that the improvement of secondary structural and EIIP-derived features was trivial. However, six EIIP-derived featuresachieved an accuracy of 0.8853, and eight secondary structure-related features obtained an accuracy of 0.8525. In contrast, theaccuracy of the hexamer score on ORF region, the most discrim-inating feature of CPAT, was only 0.8458. From Figure 3, second-ary structural and EIIP-derived features outperformed featuresof Fickett TESTCODE score, transcript length and pI value. Theperformance of EIIP-derived features was even better than thatof tool CNCI [see Figure 3 (A) and (D)]. The performances of thesecondary structure and EIIP-based features are not far inferiorto those of sequence-derived features, but sequence-derivedfeatures have achieved fairly high accuracy, thus leaving lim-ited room for other features to enhance. Nineteen features fromthese three categories were used to build our method. The sec-ondary structure calculated by RNAfold may not completely re-flect the actual structural information of one sequence.Therefore, LncFinder can predict lncRNA with sequence-derivedfeatures and EIIP features only.

Five widely used machine learning algorithms, namely, lo-gistic regression, SVM, random forest, ELM and deep learning

LncFinder | 13


Deleted Text: ,

Deleted Text: Since



Deleted Text: -

Deleted Text: ,

Deleted Text: seven

Deleted Text: six

Deleted Text: ;

Deleted Text: Tool

Deleted Text: ,

Deleted Text: was much lower

Deleted Text: ,

Deleted Text: which made

Deleted Text: s

Deleted Text: ,

Deleted Text: displayed the

Deleted Text: performance better

Deleted Text: ,

Deleted Text: with the

Deleted Text: percent

Deleted Text: that of

Deleted Text: s

Deleted Text: -

Deleted Text: ,

Deleted Text: ,

Deleted Text: ,

Deleted Text: utes

Deleted Text: econds

Deleted Text: ,

Deleted Text: ,


Deleted Text: econds,



Deleted Text: ,

Deleted Text: one

Deleted Text: ute


Deleted Text: -

Deleted Text: ,


Deleted Text: ,

Deleted Text: the

Deleted Text: ,





Deleted Text: presented a

Deleted Text: performance

Deleted Text: just

Deleted Text: ,

Deleted Text: (

Deleted Text: 19


Deleted Text: -

Deleted Text: ,

were compared to determine how much will different machinelearning algorithms affect the performance of lncRNA identifi-cation using the features we designed. Deep learning in our testhad the lowest accuracy of 0.9523, while SVM obtained 0.9687.Because there were only 19 features used to build classifiers, itmay be unnecessary to employ deep learning for such a smallscale of features. It is also worth mentioning that many of thespecies have only limited lncRNA sequences. The insufficienttraining data may lead to overfitting of the deep learning model.Also, deep learning requires tuning of many parameters, whichrequires a much longer time than other models to perform par-ameter tuning and obtain the optimal model. Only minor dis-tinctions existed among logistic regression, SVM, random forestand ELM. The difference in accuracy between the SVM modeland the random forest model was merely 0.0006, which sug-gests that these 19 features are very robust, and the fundamen-tal features are of crucial importance in lncRNA identification.In our experiments, SVM displayed the highest accuracy andF-measure; random forest presented the best AUC; logistic re-gression is fast and easy to build. Our lncRNA identificationmethod is developed using SVM not only because SVM achievedthe highest accuracy but also for its small size and convenientapplication. If we apply random forest algorithm, the size of thefinal package will be about 25 times as large as that of the cur-rent version. As to logistic regression, the best cutoffs of differ-ent species may vary widely, which may produce an adverseeffect on the tool’s generalization ability. Using LncFinder, userscan construct different classifiers with various machine learn-ing algorithms.

We further compared our method (denoted by LncFinder)with five popular lncRNA identification tools, namely CPC,CPAT, CNCI, PLEK and CPC2. These five tools are selected be-cause they are typical and considered state of the art. CPC is aclassic alignment-based tool, whereas the other four tools arealignment-free. CPAT and PLEK can be re-trained by new datasets, which can also present a comprehensive comparison.Because results of BLASTX largely depend on the protein refer-ence database and play an important role in CPC prediction,CPC does not have to train several models for different speciesas long as the reference database is large and comprehensiveenough. Nonetheless, CPC requires about 90 GB of free space forstoring the reference database of NCBI or more than 20 GB forthe database of UniRef90. Additionally, CPC needs a lot of timeto complete the process of alignment, which makes CPC lessefficient than other alignment-free tools. For human and mousedata sets, CPC had the highest specificity but the lowest sensi-tivity. This imbalanced performance has led to unsatisfactoryaccuracy. CPC2 predicted lncRNAs with sequence-intrinsic fea-tures alone and had the result much better than CPC on thehuman data set. However, the performance of CPC2 was slightlylower than that of CPC on the mouse data set. For otheralignment-free identification tools, CNCI and PLEK (pre-builtmodel) had comparable results. The accuracy of CPAT washigher than that of CPC, CNCI and PLEK, but lower than that ofLncFinder. LncFinder achieved the best performances onhuman and mouse data sets, even when the secondarystructure-derived features were excluded.

As to plant species, we observed some intriguing phenom-ena from each tool’s performance on wheat data set. CPCobtained the best result on wheat data set, despite its lower sen-sitivity and accuracy on human and mouse data sets. CNCI andPLEK, though provide models that can be used to predictlncRNA of plant, their performances on wheat were hardly ac-ceptable. One possible explanation is that there are fewer

similarities between protein-coding transcripts and lncRNAs inwheat than in human. For instance, 38.44% (961/2500) oflncRNAs from human test set B has BLASTX HITS but only1.95% (39/2000) of lncRNAs from wheat test set has BLASTXHITS. Consequently, CPC finds fewer HITS in lncRNAs in wheat,and thus avoids classifying lncRNAs as mRNA and has low sen-sitivity. Moreover, the nucleotide usages of different plants maybe less conserved than those of vertebrates. Thus, the toolsgreatly depending on nucleotide composition features, such asCNCI and PLEK, displayed poor results when the species of testset largely differs from the species of their pre-built models’training set. Gene structure is closely related to evolutionarychanges and protein functionality. The differences in genestructure may also affect the classifier’s performance.Compared with the popular alignment-free tools, LncFinder dis-played the best accuracy and F-measure. LncFinder and CPATavoid using every nucleotide composition frequency to con-struct the model, which helps cushion the effect of various in-trinsic compositions of different species. Unlike hexamer scorethat only uses the hexamer frequencies of reference data set,LncFinder also considers the frequencies of unevaluatedsequences, thus showing more stable performances than CPAT.According to our evaluation, we can find that it is essential tobuild new models for different species, but only CPAT, PLEK andLncFinder support model re-training. For some poorly exploredspecies, limited lncRNA may be not sufficient to train a newmodel, and we need to employ models trained on other species.In that case, CPAT may present inadequate results owing to itswide range of cutoffs for different species. But LncFinder’s de-fault model (trained with human data set) showed more reliablecross-species performances than the default models of CPC,CPAT, CNCI, PLEK and CPC2.

The computational time of each tool was also evaluated.LncFinder is more efficient than CPC, CNCI and PLEK. CPC is theslowest owing to the process of alignment. CNCI is less efficientthan other alignment-free tools mainly because it takes moretime to find the MLCDS region. PLEK employs 1364 features thatslow the prediction and make the process of model re-trainingextremely time-consuming. CPAT and CPC2 are faster thanLncFinder mainly because (1) the source codes of CPAT andCPC2 were implemented in C and Python, which are faster thanR, and (2) CPAT used logistic regression to build machine learn-ing model, which is faster than SVM. Nonetheless, LncFinder isqualified to predict lncRNA at a large-scale level with an accept-able time cost.

In this study, we comprehensively reviewed and evaluateddifferent lncRNA identification features and tools. And we alsodeveloped a valuable and user-friendly package LncFinder.However, there remain some tough challenges. The sequencecompositions of different species showed varying degrees of dif-ferences, which entails intrinsic composition-based tools sup-porting model re-training. Nonetheless, the further question isthat we cannot build models for all species. Hence, more criticalfeatures that can be applied to multiple species needed to beexplored, especially for plants. Furthermore, the performancesof each tool vary from species to species, and it is practically im-possible to know in advance which tool can achieve the highestaccuracy on a specific species. Thus, a tool with stable and sat-isfactory results on multiple species is highly essential forlncRNA research.

In this study, 19 critical features are obtained from featureselection and 10-fold CV, which could reveal some valuabledistinctions between lncRNAs and mRNAs from the perspec-tive of sequence-intrinsic composition, secondary structure

14 | Han et al.


Deleted Text: lots

Deleted Text: only

Deleted Text: has

Deleted Text: to be tuned

Deleted Text: ,

Deleted Text: ,

Deleted Text: -

Deleted Text: ,

Deleted Text: -

Deleted Text: -

Deleted Text: -


Deleted Text: big

Deleted Text: efficiency

Deleted Text: datasets of

Deleted Text: an


Deleted Text: ,

Deleted Text: dataset

Deleted Text: dataset

Deleted Text: ;

Deleted Text: wheat&hx2019;s

Deleted Text: which

Deleted Text: having

Deleted Text: a

Deleted Text: to

Deleted Text: and

Deleted Text: ,

Deleted Text: -

Deleted Text: due

Deleted Text: -

Deleted Text: ,

Deleted Text: due

Deleted Text: ,

Deleted Text: which

Deleted Text: ;


Deleted Text: still

Deleted Text: remain

Deleted Text: that


Deleted Text: ,

and EIIP-based physicochemical property. These features areexpected to play positive roles in other lncRNA-related re-search, such as interaction, annotation and evolution. As anintegrated lncRNA identification platform, LncFinder can fa-cilitate relevant research and provide scientists with usefulinformation.

Application of LncFinder

Functions of LncFinder are not limited to lncRNA identification.The stand-alone version of LncFinder is a one-stop package forfeature extraction, feature selection, model validation, classifierconstruction and performance evaluation. LncFinder’s lncRNAidentification algorithm is developed based on the optimal fea-ture combination and the most appropriate classifiers. A webserver is provided for lncRNA identification to make LncFinder ahighly flexible and remarkably user-friendly tool.

R package of LncFinder

R package of LncFinder has been included in CRAN. Users cansimply install LncFinder by entering the command‘install.packages(“LncFinder”)’ in R, and an appropriate versionwill be installed automatically. Package and reference manualcan also be downloaded from CRAN (stable version): https://CRAN.R-project.org/package¼LncFinder or GitHub (Dev version):https://github.com/HAN-Siyu/LncFinder.

The stand-alone version of LncFinder provides a batch of prac-tical functions to facilitate lncRNA identification and analysis. (1)LncFinder provides a novel lncRNA identification method. Modelsfor multiple species are provided. Two modes can be selected toidentify lncRNA with or without using secondary structure-derived features. Secondary structure sequences can be loadedfrom external files, in case users should have structural dataobtained from experiments or other reliable sources. (2) LncFindercan be used to build new machine learning classifiers. Featuresand classifiers can all be customized, which helps users constructmodels with various feature groups or machine learning algo-rithms. LncFinder can extract various alignment-free featuressuch as GC content, k-mer frequencies, hexamer score, FickettTESTCODE score, length and coverage of ORF, Euclidean-/Logarithm-distance of k-mer frequencies, EIIP-derived features,multi-scale secondary structural features and pI value. Machinelearning algorithms such as logistic regression, SVM, random for-est can be employed to construct models with parameter tuning.(3) Machine learning-related functions such as feature selection,k-fold CV and parameter tuning are also included in LncFinder tohelp users select the optimal feature combination and machinelearning algorithm. The functions, descriptions and options ofLncFinder R package have been briefly summarized in Table 5.Please refer to Supplementary File 2—R Package andSupplementary File 4—R Package Manual for examples anddetailed information. The Documentation of LncFinder is gener-ated with R package” roxygen2” [73].

Table 5. Functions and descriptions of LncFinder R package

Function Description Option

Functions for classic features extractioncompute_EIIP() Compute EIIP-derived features 1. spectrum.percent: set the percentage of the sorted power

spectrum;2. quantile.probs: set the quantile interval

compute_EucDist() Compute Euclidean-distance features 1. k: set the sliding window size;2. step: set the sliding window step;3. on.ORF: calculate features on ORF region

compute_FickettScore() Compute Fickett TESTCODE Score on.ORF: calculate Fickett TESTCODE Score on ORF regioncompute_GC() Compute GC content on.ORF: calculate GC content on ORF regioncompute_hexamerScore() Compute hexamer score see compute_EucDist()compute_kmer() Compure k-mer features improved.mode: use the improved method proposed by PLEK;

other options see compute_EucDist()compute_LogDist() Compure Logarithm-distance see compute_EucDist()compute_pI() Compure isoelectric point 1. on.ORF: calculate isoelectric point on ORF region;

2. ambiguous.base: take ambiguous bases into accountfind_orfs() Find ORFs reverse.strand: find ORFs on the reverse strand

Functions for LncRNA identification and new classifier constructionlnc_finder() Identify lncRNAs using LncFinder 1. svm.model: select species, such as human, mouse and

wheat;2. SS.features: use multi-scale secondary structure features

build_model() Build new model using LncFinder see lnc_finder()extract_features() Extract features proposed by LncFinder SS.features: extract multi-scale secondary structure featuresread_SS() Load external secondary structure informationrun_RNAfold() Run RNAfold and capture the resultssvm_cv() Perform cross validation for SVM model 1. folds.num: set the number of folds for cross-validation;

2. seed: set the seed for random number generation; otherparameters for SVM model training

svm_tune() Tune SVM model see svm_cv()

This table briefly summaries the main functions of LncFinder R package. All functions and descriptions are based on LncFinder R Package (version 1.1.2). The package

will be updated regarlarly. Refer to Supplementary File 4 - R package Manual for detailed descriptions and examples.

LncFinder | 15


Deleted Text: ,

Deleted Text: ,



Deleted Text: &hx201D;)&hx201D;

Deleted Text: Stable




https://github.com/HAN-Siyu/LncFinder

Deleted Text: contend


Deleted Text: ,

Deleted Text: , and so forth

Deleted Text: ,

Deleted Text: ,




Web server of LncFinder

The web interface of LncFinder is developed to suit the conveni-ence of the users. This web server is available at http://bmbl.sdstate.edu/lncfinder/. A backup server is also established,which can be accessed via http://csbl.bmb.uga.edu/mirrors/JLU/lncfinder/. Figure 7 is a screenshot of LncFinder’s web server.

The web server provides the following three functional mod-ules: (1) lncRNA identification for multiple species; (2) down-loads of multi-species models, data sets and secondarystructural sequences; and (3) an instructive summary oflncRNA-related tools, databases and news.

The web server of LncFinder supports sequences in FASTA for-mat as input. Users can input sequences in the text area or just

Figure 7. Screenshot of LncFinder’s web server.

16 | Han et al.




http://csbl.bmb.uga.edu/mirrors/JLU/lncfinder/

http://csbl.bmb.uga.edu/mirrors/JLU/lncfinder/

Deleted Text: ,

Deleted Text: ,

upload a FASTA file. Users can also select to identify lncRNAs withor without multi-scale secondary structural features. Now five spe-cies, namely, human, mouse, chicken, zebrafish and wheat, areavailable on our web server. The results will be displayed after theidentification is complete. The original results and structure-related sequences can be exported and downloaded. Each predic-tion task will be assigned a Job ID, and users can use the Job ID todownload previous results and secondary structure-derivedsequences. Moreover, additional models for other species can bedownloaded for local use. The web server also provides an inform-ative summary for users’ convenience, which includes the updatedinformation on lncRNA prediction tools, various kinds of databasesand lncRNA research progress. The summaries will be updatedregularly. See Supplementary File 3 - Web Server for detailedinformation.

Supplementary Data

Supplementary data are available online at https://academic.oup.com/bib.

Acknowledgements

The authors are grateful to Li YL. (Hokkaido University), SunY. (Beijing Normal University) and Wang RY. (NihonUniversity) for assisting with data collection; to Cheng M.,Fan LR., Guo Y. and Tao MX. from Jilin University for the testsof R package and web server. The authors also thank the edi-tor and anonymous reviewers for handling this manuscript.

Funding

This work was supported by the National Natural ScienceFoundation of China (61472158, 61402194, and 71774154),Natural Science Foundation of Jilin Province (20180101331JC,20170520063JH and 20180101050JC), Zhuhai Premier-Discipline Enhancement Scheme, Guangdong Premier Key-Discipline Enhancement Scheme and Graduate InnovationFund of Jilin University (2017124).

Key Points

• Many classic features were reviewed and discussed.Features from the following three categories:Euclidean-/Logarithm-distance of hexamer, multi-scalesecondary structural information and EIIP-based physi-cochemical property were also explored to enhance theaccuracy of lncRNA prediction.

• Based on these three feature categories, a novel lncRNAidentification algorithm was developed with the com-prehensive processes of feature selection and modelvalidation. Benchmarked against several state-of-the-art methods, our algorithm presented the most robustand satisfactory performance on multiple species.

• An integrated platform LncFinder was developed to fa-cilitate lncRNA identification and analysis. LncFindercan effectively perform feature extraction, feature selec-tion, classifier construction and performance evalu-ation. Our lncRNA identification algorithm was includedin LncFinder as well.

• Released as R package and web server, LncFinder canbe run on multiple OS platforms. LncFinder is a flexible

and useful tool for coding/non-coding sequence predic-tion, coding potential assessment, lncRNA propertyanalysis, machine learning model construction and per-formance evaluation.

• It is necessary for tools to support model retraining,which could significantly improve their performance.More critical features need to be designed to developrobust and species-neutral tools.

References1. Harrow J, Frankish A, Gonzalez JM, et al. GENCODE: the refer-

ence human genome annotation for the ENCODE project.Genome Res 2012;22(9):1760–74.

2. Derrien T, Johnson R, Bussotti G, et al. The GENCODE v7 cata-logue of human long non-coding RNAs: analysis of theirstructure, evolution and expression. Genome Res 2012;22(9):1775–89.

3. Guttman M, Russell P, Ingolia NT, et al. Ribosome profilingprovides evidence that large noncoding RNAs do not encodeproteins. Cell 2013;154(1):240–51.

4. Djebali S, Davis CA, Merkel A, et al. Landscape of transcriptionin human cells. Nature 2012;489(7414):101–8.

5. Pennisi E. Genomics. Encode project writes eulogy for junkDNA. Science 2012;337(6099):1159, 1161.

6. Yang Q, Zhang S, Liu H, et al. Oncogenic role of long noncod-ing RNA AF118081 in anti-benzo[a]pyrene-trans-7, 8-dihydro-diol-9, 10-epoxide-transformed 16HBE cells. Toxicol Lett 2014;229(3):430–9.

7. Bhartiya D, Kapoor S, Jalali S, et al. Conceptual approaches forlncRNA drug discovery and future strategies. Expert Opin DrugDiscov 2012;7(6):503–13.

8. Rinn JL, Chang HY. Genome regulation by long noncodingRNAs. Ann Rev Biochem 2012;81:145–66.

9. Lu Q, Ren S, Lu M, et al. Computational prediction of associa-tions between long non-coding RNAs and proteins. BMCGenomics 2013;14:651.

10.da Rocha ST, Boeva V, Escamilla-Del-Arenal M, et al. Jarid2 isimplicated in the initial xist-induced targeting of PRC2 to theinactive X chromosome. Mol Cell 2014;53(2):301–16.

11.O’Leary VB, Ovsepian SV, Carrascosa LG, et al. PARTICLE, atriplex-forming long ncRNA, regulates locus-specific methy-lation in response to low-dose irradiation. Cell Rep 2015;11(3):474–85.

12.Zhang Y, Tao Y, Liao Q. Long noncoding RNA: a crosslink inbiological regulatory network. Brief Bioinform 2017, in press.doi: 10.1093/bib/bbx042.

13.Chen X, Yan CC, Zhang X, You ZH. Long non-coding RNAsand complex diseases: from experimental results to compu-tational models. Brief Bioinform 2017;18(4):558–76.

14.Shi X, Sun M, Liu H, et al. A critical role for the long non-codingRNA GAS5 in proliferation and apoptosis in non-small-celllung cancer. Mol Carcinog 2015;54(Suppl 1):E1–12.

15.Ng SY, Lin L, Soh BS, et al. Long noncoding RNAs in develop-ment and disease of the central nervous system. Trends Genet2013;29(8):461–8.

16.Congrains A, Kamide K, Oguro R, et al. Genetic variants at the9p21 locus contribute to atherosclerosis through modulationof ANRIL and CDKN2A/B. Atherosclerosis 2012;220(2):449–55.

17.Chen G, Wang Z, Wang D, et al. LncRNADisease: a databasefor long-non-coding RNA-associated diseases. Nucleic AcidsRes 2013;41(Database issue):D983–6.

LncFinder | 17


Deleted Text: ,

Deleted Text: d

Deleted Text: ,



https://academic.oup.com/bib

https://academic.oup.com/bib


Deleted Text: ,

Deleted Text: ,

Deleted Text: ,

18.Ning S, Zhang J, Wang P, et al. Lnc2Cancer: a manually curateddatabase of experimentally supported lncRNAs associatedwith various human cancers. Nucleic Acids Res 2016;44(D1):D980–5.

19.Xu J, Bai J, Zhang X, et al. A comprehensive overview of lncRNAannotation resources. Brief Bioinform 2016;18(2):236–49.

20.Yotsukura S, duVerle D, Hancock T, et al. Computational rec-ognition for long non-coding RNA (lncRNA): software anddatabases. Brief Bioinform 2017;18(1):9–27.

21.Kong L, Zhang Y, Ye ZQ, et al. CPC: assess the protein-codingpotential of transcripts using sequence features and sup-port vector machine. Nucleic Acids Res 2007;35(Suppl 2):W345–9.

22.Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST andPSI-BLAST: a new generation of protein database search pro-grams. Nucleic Acids Res 1997;25(17):3389–402.

23.Ulitsky I, Bartel DP. lincRNAs: genomics, evolution, andmechanisms. Cell 2013;154(1):26–46.

24.Wang L, Park HJ, Dasari S, et al. CPAT: coding-potential as-sessment tool using an alignment-free logistic regressionmodel. Nucleic Acids Res 2013;41(6):e74.

25.Lin MF, Jungreis I, Kellis M, et al. PhyloCSF: a comparative gen-omics method to distinguish protein coding and non-codingregions. Bioinformatics 2011;27(13):i275–82.

26.Hu L, Xu Z, Hu B, Lu ZJ. COME: a robust coding potential calcu-lation tool for lncRNA identification and characterizationbased on multiple features. Nucleic Acids Res 2017;45(1):e2.

27.Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily con-served elements in vertebrate, insect, worm, and yeastgenomes. Genome Res 2005;15(8):1034–50.

28.Achawanantakun R, Chen J, Sun Y, et al. LncRNA-ID: longnon-coding RNA IDentification using balanced random for-ests. Bioinformatics 2015;31(24):3897–905.

29.Sun L, Liu H, Zhang L, et al. lncRScan-SVM: a tool for predict-ing long non-coding RNAs using support vector machine.PLoS One 2015;10(10):e0139654.

30.Finn RD, Clements J, Eddy SR, et al. HMMER web server: inter-active sequence similarity searching. Nucleic Acids Res 2011;39:W29–37.

31.Sun L, Luo H, Bu D, et al. Utilizing sequence intrinsic compos-ition to classify protein-coding and long non-coding tran-scripts. Nucleic Acids Res 2013;41(17):e166.

32.Li A, Zhang J, Zhou Z, et al. PLEK: a tool for predicting longnon-coding RNAs and messenger RNAs based on animproved k-mer scheme. BMC Bioinformatics 2014;15:311.

33.Fickett JW. Recognition of protein coding regions in DNAsequences. Nucleic Acids Res 1982;10(17):5303–18.

34.Fickett JW, Tung CSS. Assessment of protein coding meas-ures. Nucleic Acids Res 1992;20(24):6441–50.

35.Kang YJ, Yang DC, Kong L, et al. CPC2: a fast and accurate cod-ing potential calculator based on sequence intrinsic features.Nucleic Acids Res 2017;45(W1):W12–6.

36.Bjellqvist B, Hughes GJ, Pasquali C, et al. The focusing posi-tions of polypeptides in immobilized pH gradients can be pre-dicted from their amino acid sequences. Electrophoresis 1993;14(10):1023–31.

37.Bjellqvist B, Basse B, Olsen E, et al. Reference points for com-parisons of two-dimensional maps of proteins from differenthuman cell types defined in a pH scale where isoelectricpoints correlate with polypeptide compositions.Electrophoresis 1994;15(1):529–39.

38.Tripathi R, Patel S, Kumari V, et al. DeepLNC, a long non-coding RNA prediction tool using deep neural network. NetwModel Anal Health Inform Bioinforma 2016;5(1):21.

39.Zhao J, Song X, Wang K, et al. lncScore: alignment-free identi-fication of long noncoding RNA from assembled novel tran-scripts. Sci Rep 2016;6(1):34838.

40.Wucher V, Legeai F, Hedan B, et al. FEELnc: a tool for long non-coding RNAs annotation and its application to the dog tran-scriptome. Nucleic Acids Res 2017;45(8):e57.

41.Han S, Liang Y, Li Y, et al. Long noncoding RNA identification:comparing machine learning based tools for long noncodingtranscripts discrimination. Biomed Res Int 2016;2016:8496165.

42.Kozak M. Recognition of AUG and alternative initiator codonsis augmented by G in position þ4 but is not generally affectedby the nucleotides in positions þ5 and þ6. EMBO J 1997;16(9):2482–92.

43.Kozak M. Initiation of translation in prokaryotes and eukar-yotes. Gene 1999;234(2):187–208.

44. Ingolia NT, Lareau LF, Weissman JS, et al. Ribosome profilingof mouse embryonic stem cells reveals the complexity anddynamics of mammalian proteomes. Cell 2011;147(4):789–802.

45.Kent WJ, Sugnet CW, Furey TS, et al. The human genomebrowser at UCSC. Genome Res 2002;12(6):996–1006.

46.Hu L, Di C, Kai M, et al. A common set of distinct features thatcharacterize noncoding RNAs across multiple species. NucleicAcids Res 2015;43(1):104–14.

47.Chen C, Liaw A, Breiman L. Using Random Forest to LearnImbalanced Data. Technical Report 1999, Department ofStatistics, UC Berkeley Andy, 2004.

48.Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA hom-ology searches. Bioinformatics 2013;29(22):2933–5.

49.Wang L, Feng Z, Wang X, et al. DEGseq: an R package for iden-tifying differentially expressed genes from RNA-seq data.Bioinformatics 2010;26(1):136–8.

50.Necsulea A, Soumillon M, Warnefors M, et al. The evolution oflncRNA repertoires and expression patterns in tetrapods.Nature 2014;505(7485):635–40.

51.Nair AS, Sreenadhan SP. A coding measure scheme employ-ing electron-ion interaction pseudopotential (EIIP).Bioinformation 2006;1(6):197–202.

52.Harrow J, Denoeud F, Frankish A, et al. GENCODE: producing areference annotation for ENCODE. Genome Biol 2006;7(Suppl1):S4.1–9.

53.Yates A, Akanni W, Amode MR, et al. Ensembl 2016. NucleicAcids Res 2016;44:D710–16.

54.Burge SW, Daub J, Eberhardt R, et al. Rfam 11.0: 10 years ofRNA families. Nucleic Acids Res 2013;41(D1):D226–32.

55.Mattei E, Ausiello G, Ferre F, et al. A novel approach to repre-sent and compare RNA secondary structures. Nucleic Acids Res2014;42(10):6146–57.

56.Lorenz R, Bernhart SH, Honer zu Siederdissen C, et al.ViennaRNA Package 2.0. Algorithms Mol Biol 2011;6(1):26.

57.Clark MB, Johnston RL, Inostroza-Ponta M, et al. Genome-wide analysis of long noncoding RNA stability. Genome Res2012;22(5):885–98.

58.Charif D, Lobry JR. SeqinR 1.0-2: a contributed package tothe R Project for statistical computing devoted to biologicalsequences retrieval and analysis. In: Bastolla U, Porto M,Roman HE, Vendruscolo M (eds), Structural Approaches toSequence Evolution: Molecules, Networks, Populations.Biological and Medical Physics, Biomedical Engineering.New York: Springer Verlag, 2007, 207–232.

59.Silverman BD, Linsker R. A measure of DNA periodicity. JTheor Biol 1986;118(3):295–300.

60.Tsonis AA, Elsner JB, Tsonis PA, et al. Periodicity in DNA cod-ing sequences: implications in gene evolution. J Theor Biol1991;151(3):323–31.

18 | Han et al.


61.Tiwari S, Ramachandran S, Bhattacharya A, et al. Prediction ofprobable genes by fourier analysis of genomic sequences.Comput Appl Biosci 1997;13(3):263–70.

62.Cohen J. A coefficient of agreement for nominal scales. EducPsychol Meas 1960;20(1):37–46.

63.Kuhn M. Building predictive models in R using the caret pack-age. J Stat Soft 2008;28(5):1–26.

64.Chang CC, Lin CJ. LIBSVM. A library for support vectormachines. ACM Trans Intell Syst Technol 2011;2(3):27.

65.Meyer D, Dimitriadou E, Hornik K et al. e1071: misc Functions of

the Department of Statistics, Probability Theory Group (Formerly: e

1071), TU Wien. R package version 1.6–8. 2017. https://CRAN.R-

project.org/package=e1071.66.Liaw A, Wiener M. Classification and Regression by

randomForest. R News 2002;2(3):18–22.67.Huang GB, Wang DH, Lan Y, et al. Extreme learning machines:

a survey. Int J Mach Learn Cybern 2011;2(2):107–22.

68.Gosso A. elmNN: Implementation of ELM (Extreme Learning

Machine) algorithm for SLFN (Single Hidden Layer

Feedforward Neural Networks). R package version 1.0. 2012.

https://CRAN.R-project.org/package=elmNN.69.H2O.ai. R Interface for H2O, R package version 3.10.0.8. 2016.

https://github.com/h2oai/h2o-3.70.Robin X, Turck N, Hainard A, et al. pROC: an open-source

package for R and Sþ to analyze and compare ROC curves.BMC Bioinformatics 2011;12:77.

71.Wickham H. ggplot2: Elegant Graphics for Data Analysis. NewYork: Springer-Verlag, 2009.

72.Suzek BE, Huang H, McGarvey P, et al. UniRef: comprehensiveand non-redundant UniProt reference clusters. Bioinformatics2007;23(10):1282–8.

73.Wickham H, Danenberg P, Eugster M; RStudio. roxygen2: In-Line Documentation for R. R package version 6.0.1. 2017. https://CRAN.R-project.org/package=roxygen2.

LncFinder | 19


https://CRAN.R-project.org/package=e1071

https://CRAN.R-project.org/package=e1071

https://CRAN.R-project.org/package=elmNN

https://github.com/h2oai/h2o-3

https://CRAN.R-project.org/package=roxygen2

https://CRAN.R-project.org/package=roxygen2

Documents

LncFinder: an integrated platform for long non-coding RNA