13
Research Article A Novel Feature Selection Strategy for Enhanced Biomedical Event Extraction Using the Turku System Jingbo Xia, 1,2 Alex Chengyu Fang, 2,3 and Xing Zhang 2,3 1 College of Science, Huazhong Agricultural University, Wuhan, Hubei 430070, China 2 Department of Chinese, Translation and Linguistics, City University of Hong Kong, Kowloon, Hong Kong 3 e Halliday Centre for Intelligent Applications of Language Studies, City University of Hong Kong, Kowloon, Hong Kong Correspondence should be addressed to Alex Chengyu Fang; [email protected] Received 14 January 2014; Revised 22 February 2014; Accepted 3 March 2014; Published 6 April 2014 Academic Editor: Shigehiko Kanaya Copyright © 2014 Jingbo Xia et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Feature selection is of paramount importance for text-mining classifiers with high-dimensional features. e Turku Event Extraction System (TEES) is the best performing tool in the GENIA BioNLP 2009/2011 shared tasks, which relies heavily on high- dimensional features. is paper describes research which, based on an implementation of an accumulated effect evaluation (AEE) algorithm applying the greedy search strategy, analyses the contribution of every single feature class in TEES with a view to identify important features and modify the feature set accordingly. With an updated feature set, a new system is acquired with enhanced performance which achieves an increased -score of 53.27% up from 51.21% for Task 1 under strict evaluation criteria and 57.24% according to the approximate span and recursive criterion. 1. Introduction Knowledge discovery based on text mining technology has long been a challenging issue for both linguists and knowledge engineering scientists. e application of text mining technologies based on large collections of known texts, such as the MEDLINE data base, has become especially popular in the area of biological and medical information processing [1, 2]. However, the rate of data accumulation is ever increasing at an astonishing speed [3]. e published literature grows exponentially and huge amounts of scientific information such as protein property and gene function are widely hidden in prohibitively large collections of text. For example, the literature in PubMed grows at a speed of two printed pages per second. As a result, it is practically impossible to manually curate experimental results from texts. As a result of this information explosion, text mining and linguistic methods have been used to perform automatic named entity recognition (NER) including biomedical NER [4], the extraction of clinical narratives [5] and clinical trials [6], analysis of the similarity in gene ontology [7], and the prioritization of vital function genes [8]. While entity recognition has been exploited as a powerful approach towards automatic NER retrieval, there has recently been an increased interest to find more complex structural information and more abundant knowledge in documents [9]. Hence, as a more recent development, moving beyond the purpose of entity recognition, the GENIA task in the BioNLP 2009/2011 shared tasks [10, 11] was set to identify and extract nine biomedical events from GENIA-based corpus texts [12], including gene expression, transcription, protein catabolism, localization, binding, phosphorylation, regulation, positive regulation, and negative regulation. e GENIA task is a publicly accepted task within the BioNLP communities and it is a leading pioneer in the area of structured event extraction. Designed by Bjorne et al. [13], the Turku Event Extraction System (TEES) was the leading participant tool in both the BioNLP 2009 and 2011 shared tasks. Among the twenty- four participants of BioNLP shared tasks, TEES ranked first place in the GENIA task in 2009. While Riedel et al. performed best (with an -score of 56.0%) for the GENIA task in 2011 ST [14], TEES was the third for the GENIA task and ranked number one for the other four of the eight subtasks. In 2011 and 2012, Bj¨ orne et al. [15, Hindawi Publishing Corporation BioMed Research International Volume 2014, Article ID 205239, 12 pages http://dx.doi.org/10.1155/2014/205239

Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

Research ArticleA Novel Feature Selection Strategy for Enhanced BiomedicalEvent Extraction Using the Turku System

Jingbo Xia12 Alex Chengyu Fang23 and Xing Zhang23

1 College of Science Huazhong Agricultural University Wuhan Hubei 430070 China2Department of Chinese Translation and Linguistics City University of Hong Kong Kowloon Hong Kong3The Halliday Centre for Intelligent Applications of Language Studies City University of Hong Kong Kowloon Hong Kong

Correspondence should be addressed to Alex Chengyu Fang acfangcityueduhk

Received 14 January 2014 Revised 22 February 2014 Accepted 3 March 2014 Published 6 April 2014

Academic Editor Shigehiko Kanaya

Copyright copy 2014 Jingbo Xia et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Feature selection is of paramount importance for text-mining classifiers with high-dimensional features The Turku EventExtraction System (TEES) is the best performing tool in the GENIA BioNLP 20092011 shared tasks which relies heavily on high-dimensional featuresThis paper describes research which based on an implementation of an accumulated effect evaluation (AEE)algorithm applying the greedy search strategy analyses the contribution of every single feature class in TEES with a view to identifyimportant features and modify the feature set accordingly With an updated feature set a new system is acquired with enhancedperformance which achieves an increased 119865-score of 5327 up from 5121 for Task 1 under strict evaluation criteria and 5724according to the approximate span and recursive criterion

1 Introduction

Knowledge discovery based on text mining technologyhas long been a challenging issue for both linguists andknowledge engineering scientists The application of textmining technologies based on large collections of knowntexts such as theMEDLINE data base has become especiallypopular in the area of biological and medical informationprocessing [1 2] However the rate of data accumulation isever increasing at an astonishing speed [3] The publishedliterature grows exponentially and huge amounts of scientificinformation such as protein property and gene functionare widely hidden in prohibitively large collections of textFor example the literature in PubMed grows at a speed oftwo printed pages per second As a result it is practicallyimpossible to manually curate experimental results fromtexts As a result of this information explosion text miningand linguistic methods have been used to perform automaticnamed entity recognition (NER) including biomedical NER[4] the extraction of clinical narratives [5] and clinical trials[6] analysis of the similarity in gene ontology [7] and theprioritization of vital function genes [8]

While entity recognition has been exploited as a powerfulapproach towards automatic NER retrieval there has recentlybeen an increased interest to find more complex structuralinformation and more abundant knowledge in documents[9]Hence as amore recent developmentmoving beyond thepurpose of entity recognition the GENIA task in the BioNLP20092011 shared tasks [10 11] was set to identify and extractnine biomedical events from GENIA-based corpus texts [12]including gene expression transcription protein catabolismlocalization binding phosphorylation regulation positiveregulation and negative regulation The GENIA task is apublicly accepted task within the BioNLP communities and itis a leading pioneer in the area of structured event extraction

Designed by Bjorne et al [13] the Turku Event ExtractionSystem (TEES) was the leading participant tool in both theBioNLP 2009 and 2011 shared tasks Among the twenty-four participants of BioNLP shared tasks TEES rankedfirst place in the GENIA task in 2009 While Riedel etal performed best (with an 119865-score of 560) for theGENIA task in 2011 ST [14] TEES was the third for theGENIA task and ranked number one for the other fourof the eight subtasks In 2011 and 2012 Bjorne et al [15

Hindawi Publishing CorporationBioMed Research InternationalVolume 2014 Article ID 205239 12 pageshttpdxdoiorg1011552014205239

2 BioMed Research International

16] enhanced TEES and the 119865-score was increased furtherfrom 5286 to 5315 The system was kept updated andits 101 version achieved 5565 in 119865-score MeanwhileTEES was also utilized for practical database searching Thesystem was applied to all PubMed citations and a numberof analyses of the extracted information were illustrated [17ndash19] TEES is now available at httpbionlputufi In 2011the BioNLP shared task expanded from the GENIA task toeight generalized tasks including GENIA task epigeneticsand posttranslational modification task infectious diseasetask bacteria biotope task and bacteria interaction task [11]As the only system that participated in all the tasks in 2011TEES was again ranked first place for four out of the eighttasks Afterwards Bjorne and Salakoski updated TEES 101 toTEES 20 in August 2012 The updated system is now capableof handling the DDIrsquo11 (drug-drug interaction extraction)challenge (httpjbjornegithubioTEES) in addition to allthe BioNLP 2011 shared tasks At the moment the most up-to-date version is TEES 21 released in 2013 [20] and a majorchange is the wider coverage for serving more XML formatcorpus and the core algorithm remains unchanged

As a sophisticated text mining classifier TEES usesenormous feature classes For the BioNLP 2009 shared tasksit produced 405165 features for trigger detection and 477840features for edge detection out of the training data whichnot only consume large amounts of processing time but alsocreate undesirable noises to affect system performance Toaddress this issue and achieve even better system perfor-mance in terms of processing time and recognition accuracya natural starting point is to performan in-depth examinationof the system processes for feature selection (FS) and featurereduction (FR)

The integration of FS and FR has been demonstratedto hold good potentials to enhance an existing system [21ndash24] Feature reduction can delete redundant features avoidoverfitting provide more efficient modelling and help gaina deeper insight into the importance of features It aims toobtain a subset through an optimization between the featuresize and the performance [25] Amongst all of the featurereduction algorithms the greedy search algorithm also calledhill climbing algorithm is such an optimal technique forthe identification and selection of important features [26]van Landeghem et al [27] introduced a feature selectionstrategy in the Turku system for a different testing data setand illustrated how feature selection could be applied tobetter understand the extraction framework In their schemecandidate events were trained to create a feature selector foreach event type The classifier was then built with the filteredtraining samples By doing this different trigger words fordifferent event types were discriminated and those individualfeatures with a higher occurrence frequency were assignedhigher weights Tag clouds of lexical vertex walks showedcertain properties that are interesting from a linguistic pointof view

The purpose of this paper is to focus on the featureselection rules of TEES and aim to improve its performancefor the GENIA task By designing an objective function in thegreedy search algorithm we propose an accumulated effectevaluation (AEE) algorithm which is simple and effective

and can be used to numerically evaluate the contributionof each feature separately concerning its performance in thecombined test Moreover we make further changes to thecore feature class by incorporating new features and mergingredundant features in the system The updated system wasevaluated and found to produce a better performance in bothTasks 1 and 2 In Task 1 our system achieves a higher 119865-scoreof 5327 according to the ldquostrict evaluationrdquo criterion and5724 according to the ldquoapproximate span and recursiverdquocriterion In Task 2 the new strategy achieved an 119865-scoreof 5177 under the ldquostrict evaluationrdquo criterion and 5579under the ldquoapproximate span and recursiverdquo criterion Theserepresent the best performance scores till now for eventrecognition and extraction

2 Materials and Method

21 GENIA Task and Scheme of Turku TEES System

211 GENIATask and the Evaluation Criteria Different fromthe previous NER task the GENIA task aims to recognizeboth the entities and the event relationship between suchentities Extended from the idea of semantic networks therecognition task includes the classification of entities (nodes)and their associated events (edges) The participants of theshared task were expected to identify nine events concern-ing given proteins that is gene expression transcriptionprotein catabolism localization binding phosphorylationregulation positive regulation and negative regulation Themandatory core task Task 1 involves event trigger detectionevent typing and primary argument recognition [10] Anadditional optional task Task 2 involves the recognition ofentities and the assignment of these entities Finally Task 3targets the recognition of negation and speculation Becauseof their fundamental importance and vast potential for futuredevelopments researchers mainly focus on Tasks 1 and 2

Like the other text mining systems the performance ofTEES is evaluated by precision recall and 119865-score The firstmeasure equals the fraction of obtained relevant documentsand the retrieved documents represent the correctness of theextraction system that is

Precision

=|relevant documents cap retrieved documents|

|retrieved documents|

(1)

The second measure recall is defined as

Recall = |relevant documents cap retrieved documents||relevant documents|

(2)

Recall is used to assess the fraction of the documentsrelevant to the query that are successfully retrieved Precisionand recall indicators are well-known performance measuresin textmining while119865-score is a thirdmeasure that combines

BioMed Research International 3

Protein Protein ProteinRegulation Regulation

Regulation

IL-4 Gene regulation Involves NFAT1 and NFAT2

Theme

Theme CauseTheme

Cause

Figure 1The graphical representation of a complex biological event(refer to [5])

precision and recall and is the harmonic mean of precisionand recall

119865-score = 2 sdot precision sdot recallprecision + recall

(3)

The 119865-score evenly weighs the precision and recall andforms a reliable measure to evaluate the performance of a textmining system For more information refer to [28]

212 Scheme of TEES The core of TEES consists of twocomponents that is classification-style triggerevent detec-tion and rich features in graph structure By mapping thetokenized word and entity to the node and mapping eventrelation to edge between entities TEES regards the eventextraction as a task of recognizing graph nodes and edges asshown in Figure 1

Generally TEES first convert the node recognition toa problem of 10-class classification which corresponds to9 events defined in the shared task and another class fornegative case This procedure is defined as trigger detectionThereafter the edge detection is defined by the recognitionof concrete relationships between entities including semanticdirection and themecause relation

As in Figure 1 the word ldquoIL-4rdquo is assigned to class ldquopro-teinrdquo while ldquoinvolvesrdquo is assigned to class ldquoregulationrdquo Theedge between ldquoIL-4rdquo and ldquoregulationrdquo is labelled as ldquothemerdquoHence we obtain a simple regulation event namely ldquoIL-4regulationrdquo which means ldquoIL-4rdquo is the theme of ldquoregulationrdquoSimilarly the representation in Figure 1 indicates that thesimple event ldquoIL-4 regulationrdquo is the theme of ldquoinvolvesrdquo andthe protein ldquoNFAT1rdquo is the cause of ldquoinvolvesrdquo from whicha complex regulation event can be extracted that is ldquoIL-4 regulationrdquo regulates protein ldquoNFAT1rdquo For more detailedinformation refer to Bjornersquos work [5] and the website ofTEES httpbionlputufieventextractionsoftwarehtml

The data set used in TEES consists of four files in GENIAcorpus [3] Each file is given in a stand-off XML formatwith split sentences annotation of known proteins parts ofspeech (POS) and syntactic dependency tree informationOne is train123xml which contains 800 abstracts with7482 sentences another is devel123xml which contains 150abstracts with 1450 sentences The file everything123xml isthe sum of the two files above and testxml is the test dataset which contains 950 abstracts with 8932 sentences

The scheme of TEES system consists of three phases Firstlinguistic features are generated in the feature generationphase Second in the training phase train123xml is used asthe training data set devel123xml is used as development set

and optimum of parameters is obtained Then in the thirdphase using everything123xml (sum of train123xml anddevel123xml) as the training set testxml is used as unknowndata set for event prediction Events are extracted from thisunknown set and accuracy is subsequently evaluated SeeFigure 2

Mainly there are two parameters in grid searching at thetraining stage The first parameter is 119862 in polynomial kernelfunction of support vector machine Second in order to seta proper precision-recall trade-off a parameter 120573 (120573 gt 0)is introduced in trigger detection For ldquono triggerrdquo class thegiven classification score is multiplied by 120573 so as to increasethe possibility of tokens falling into ldquotriggerrdquo class Optionally120573 is set as 06 and it will be put into grid searching forobtaining optimum value

22 Definition of Feature Generation Rule of Turku SystemFeatures used for trigger detection are designed in a rationalway and abundant features are generated from training dataset

In terms of the training data set with GENIA format thedependency relation of each sentence is output by Stanfordparser [29] which addresses the syntactic relations betweenword tokens and thus converts the sentence to a graph wherethe node denotes a word token and the edge correspondsto the grammatical relation between two tokens By doingthis a directed acyclic graph (DAG) is constructed based onthe dependency parse tree and the shortest dependency path(SDP) is located

For a targeted word in sentence which represents a nodein graph the purpose of trigger detection is to recognize theevent type belonging to the word Meanwhile edge detectionis to recognize the themecause type between two entitiesTherefore both detections can be considered as multiclassclassification

The features produced in trigger detection are cat-egorized into six classes as listed in Figure 3 and thewhole feature generation rule is listed in SupplementaryAppendix A in the Supplementary Material available onlineat httpdxdoiorg1011552014205239 First for a sentenceinwhich the targetword occurs token information is countedfor the whole sentence as a bag of words (BOW)This featureclass is defined as ldquosentence featurerdquo The second featureclass is ldquomain featurerdquo of the target word including partof speech (POS) and stem information output by PorterStemmer [29] The third class ldquolinear order featurerdquo focuseson the information about the neighboring word tokens innatural order of the sentence TEES also maintains a fourthclass about the microlexical information of the word tokenfor example upper or lower case of the word existence ofdigital number double letter in the word and three lettersin the word These features constitute a ldquocontent featurerdquoclass while the ldquoattached edge featurerdquo class focuses on theinformation about the neighboring word tokens of the targetword in SDP Finally there is a sixth ldquochain featurerdquo classwhich focuses on the information of the whole SDP instead

Here the features arewell structured from themacro- andmicroperspectives about the sentence Basically the feature

4 BioMed Research International

Second phase training

BioNLP09Trainpy

BioNLP09Trainpy

BioNLP09Trainpy

Parameters

Feature generation

First phase linguistic feature generation

everything123xml testxml

SVM model

Third phase testing

train123xmldevel123xml

Tasks 1 and 2

Figure 2 Data set and pipeline of TEES cited from TEES Toolkit Location TurkuEvent ExtractionSystem readmepdf p 7 in the zip filehttpbionlputufistaticevent-extractorTurkuEventExtractionSystem-10zip

Sentence feature (18998)

Main feature (24944)

Linear order feature (73744)

Content feature (8573)

Attached edge feature (100561)

Chain feature (178345)

Trig

ger f

eatu

res c

lass

Edge

feat

ures

clas

s

Entity feature (4313)

Path length feature (25)

Terminus token feature (4144)

Single element feature (2593)

Path grams feature (441684)

Path edge feature (25038)

Sentence feature (33)

ENIA feature (4)

Figure 3 Feature class in trigger detection and edge detection

generation rule of ldquomain featurerdquo and ldquocontent featurerdquomostlyrelies on microlevel information about the word token itselfwhile ldquosentence featurerdquo and ldquolinear order featurerdquo rely onthe macrolevel information about the sentence Differentlyldquoattached edgerdquo and ldquochain featurerdquo rely on the dependencytree and graph information especially the SDP

Similarly features used for edge detection can be classi-fied into 8 classes namely entity feature path length featureterminus token feature single element feature path grams

feature path edge feature sentence feature and GENIA fea-ture Among the features above the third feature is omittedsince it is considered part of the first feature

23 Feature Evaluation Method Accumulated Effect Evalu-ation (AEE) Algorithm A quantitative method is designedto evaluate the importance of feature classes Here thefeature combination methods are ranked by 119865-score and the

BioMed Research International 5

Step 1 Categorize features into 119899 feature classes and there are 2119899 minus 1 possibilities for all kinds of featurecombination Denote119873 = 2119899 minus 1 and run all of classifying test 119865-score is recorded foreach experiment

Step 2 Sort119873 feature combinations according to 119865-score with the descending orderStep 3 FOR 119894 = 1 to 119899 119895 = 1 to119873

119874119894119895= 0 119888119894119895= 0

END FOR 119874119894119895is used to calculate the occurrence of the 119894th feature among 119895 experiments

119888119894119895evaluates the partial importance of the 119894th feature after 119895 times experiments

Step 4 FOR 119895 = 1 to119873 For each feature combination in the 119895th experimentFOR 119894 = 1 to 119899 For each feature

IF the 119894th feature occur in the 119895th feature combination119874119894119895++

For each experiment the occurrence of the 119894th feature is accumulatedEND IF119888119894119895= 119874119894119895119895

By dividing 119874119894119895by 119895 we get a numerical value to evaluate the importance

of the 119894th feature Since the items are ranked by 119865-score with a descending order the first ranked features combination corresponds to a smaller 119895 and hence a bigger 119888

119894119895

END FOREND FOR

Step 5 FOR 119894 = 1 to 119899

AEE1(119894) =119899

sum

119895=1

119888119894119895

END FOR By summing up 119888

119894119895for a fixed 119894 we get the accumulated effect of the 119894th feature

Step 6 AEE1(119894) is the objective function of greedy search algorithm and we sort AEE1(119894)to get the most important feature

Algorithm 1 AEE1 algorithm

occurrence of 119894th feature class in the top 119895th combinations(119895 runs from the top one to the last one) is counted therate of occurrence is computed and finally the sum ofthe occurrence is calculated The total calculation is shownin Supplementary Material Appendix C The algorithm isdenoted as AEE1 algorithm as shown in Algorithm 1

Here AEE1(119894) is the contribution value of the 119894th featureand also the objective function of the greedy search algo-rithm which reflects the accumulated effect of the 119894th featureamong the top classifiers Since AEE1(119894) makes sense in acomparative way the theoretical maximum and minimum ofAEE1(119894) are calculated by

Max AEE1 = 1 times 2119899minus1 + 2119899minus1

2119899minus1 + 1+2119899minus1

2119899minus1 + 2sdot sdot sdot +2119899minus1

2119899 minus 1

Min AEE1 = 0 times (2119899minus1 minus 1) + 12119899minus1+2

2119899minus1 + 1sdot sdot sdot +2119899minus1

2119899 minus 1

(4)

The idea of AEE1 comes from the understanding that thetop classifiers with higher119865-scores includemore efficient andimportant features However this consideration disregardsthe feature size in terms of those top classifiers

For better understanding a simple case is consideredAssume there are two feature classes that could be used in the

classifier and so there are three feature combinations for theclassifier namely 1 2 and 1amp2 Without loss of generality weassume that the best classifier uses the feature class 1amp2 thesecond one uses the feature class 1 and the worst classifieruses the feature class 2 Here we denote the rank list 1amp2 1and 2 as Rank Result A According to the third column inTable 1(a) AEE value will be calculated as AEE1(1) = 11 + 22+ 23 = 2667

As another example we assume a rank list 1 1amp2 and 2which is denoted as Rank Result B and shown in Table 1(b)As can be seen fromTable 1(b) AEE1(1) equals 2667 the sameas in Table 1(a) However this is not a reasonable result sincefeature 1 is more significant in Rank Result A than in RankResult B if the feature class size is considered

Therefore an alternative algorithm AEE2 is proposed byupdating 119888

119894119895= 119874119894119895119895 in Step 4 of AEE1 as 119888

119894119895= 119874119894119895sum119899

119894=1119874119894119895

Here the size of feature class is considered for the compu-tation of 119888

119894119895and a classifier with higher performance and

smaller feature size will ensure a high score for the featureit owns As an example column 4 in Tables 1(a) and 1(b)shows that AEE2(1) = 1667 in Rank Result A and AEE2(1)= 2167 in Rank Result B The result ensures the comparativeadvantage for feature class 1 in Rank Result A SimilarlyAEE2(119894) mainly makes sense in a comparative way with thetheoretical maximum and minimum values which are alsosimilarly computed

6 BioMed Research International

Table 1 Examples for AEEi algorithm

(a) AEEi result for Ranking Result A

Rank Result A 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1amp2 1 1 11 11 12 12119895 = 2 1 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 2167 AEE2(119894) 1667 1333

(b) AEEi result for Ranking Result B

Rank Result B 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1 1 0 11 01 11 01119895 = 2 1amp2 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 1167 AEE2(119894) 2167 0833

Input data with chosen feature

classes under all combination

Fixed features in edge detection

try trigger features combination

Fixed features in trigger detection

try edge features combination

Test the classifier performance

in Turku system

Evaluate the contribution

of feature classes

Best

Yes

No Modify and replace

important features

Output feature set

and build new system

Figure 4 Flowchart of the research

Considering AEE1(1) = 2667 gt AEE1(2) = 2167 andAEE2(1) = 1667 gt AEE2(2) = 1333 in Table 1(a) the impor-tance of feature class 1 prevails over that of feature class 2 inboth cases This ensures a reasonable consistency Actuallyboth AEE1 and AEE2 algorithms ensure a correct weightingrank among different feature classes and a hybrid of the twomethods is used in the experiments

24 Flowchart of the Research Unlike a trial-and-error pro-cedure the scheme of this research is oriented towardsbetter feature selection so that important feature classes areidentified by evaluating the contribution of the individualfeatures

Accordingly codes are written to enhance vital featuresThereafter the classifiers with new updated features are testedand better performance is presumed Compared with theprevious TEES system our main contribution is to importfeature selection strategy in Phase 1 that is ldquolinguistic featureselectionrdquo as shown in Figure 2The flowchart of our researchis shown in Figure 4

3 Results

31 Evaluation of Individual Feature Classes by QuantitativeMethod Based on the AEE algorithm all of the combina-tions of the feature classes are used in classifiers and their

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

2 BioMed Research International

16] enhanced TEES and the 119865-score was increased furtherfrom 5286 to 5315 The system was kept updated andits 101 version achieved 5565 in 119865-score MeanwhileTEES was also utilized for practical database searching Thesystem was applied to all PubMed citations and a numberof analyses of the extracted information were illustrated [17ndash19] TEES is now available at httpbionlputufi In 2011the BioNLP shared task expanded from the GENIA task toeight generalized tasks including GENIA task epigeneticsand posttranslational modification task infectious diseasetask bacteria biotope task and bacteria interaction task [11]As the only system that participated in all the tasks in 2011TEES was again ranked first place for four out of the eighttasks Afterwards Bjorne and Salakoski updated TEES 101 toTEES 20 in August 2012 The updated system is now capableof handling the DDIrsquo11 (drug-drug interaction extraction)challenge (httpjbjornegithubioTEES) in addition to allthe BioNLP 2011 shared tasks At the moment the most up-to-date version is TEES 21 released in 2013 [20] and a majorchange is the wider coverage for serving more XML formatcorpus and the core algorithm remains unchanged

As a sophisticated text mining classifier TEES usesenormous feature classes For the BioNLP 2009 shared tasksit produced 405165 features for trigger detection and 477840features for edge detection out of the training data whichnot only consume large amounts of processing time but alsocreate undesirable noises to affect system performance Toaddress this issue and achieve even better system perfor-mance in terms of processing time and recognition accuracya natural starting point is to performan in-depth examinationof the system processes for feature selection (FS) and featurereduction (FR)

The integration of FS and FR has been demonstratedto hold good potentials to enhance an existing system [21ndash24] Feature reduction can delete redundant features avoidoverfitting provide more efficient modelling and help gaina deeper insight into the importance of features It aims toobtain a subset through an optimization between the featuresize and the performance [25] Amongst all of the featurereduction algorithms the greedy search algorithm also calledhill climbing algorithm is such an optimal technique forthe identification and selection of important features [26]van Landeghem et al [27] introduced a feature selectionstrategy in the Turku system for a different testing data setand illustrated how feature selection could be applied tobetter understand the extraction framework In their schemecandidate events were trained to create a feature selector foreach event type The classifier was then built with the filteredtraining samples By doing this different trigger words fordifferent event types were discriminated and those individualfeatures with a higher occurrence frequency were assignedhigher weights Tag clouds of lexical vertex walks showedcertain properties that are interesting from a linguistic pointof view

The purpose of this paper is to focus on the featureselection rules of TEES and aim to improve its performancefor the GENIA task By designing an objective function in thegreedy search algorithm we propose an accumulated effectevaluation (AEE) algorithm which is simple and effective

and can be used to numerically evaluate the contributionof each feature separately concerning its performance in thecombined test Moreover we make further changes to thecore feature class by incorporating new features and mergingredundant features in the system The updated system wasevaluated and found to produce a better performance in bothTasks 1 and 2 In Task 1 our system achieves a higher 119865-scoreof 5327 according to the ldquostrict evaluationrdquo criterion and5724 according to the ldquoapproximate span and recursiverdquocriterion In Task 2 the new strategy achieved an 119865-scoreof 5177 under the ldquostrict evaluationrdquo criterion and 5579under the ldquoapproximate span and recursiverdquo criterion Theserepresent the best performance scores till now for eventrecognition and extraction

2 Materials and Method

21 GENIA Task and Scheme of Turku TEES System

211 GENIATask and the Evaluation Criteria Different fromthe previous NER task the GENIA task aims to recognizeboth the entities and the event relationship between suchentities Extended from the idea of semantic networks therecognition task includes the classification of entities (nodes)and their associated events (edges) The participants of theshared task were expected to identify nine events concern-ing given proteins that is gene expression transcriptionprotein catabolism localization binding phosphorylationregulation positive regulation and negative regulation Themandatory core task Task 1 involves event trigger detectionevent typing and primary argument recognition [10] Anadditional optional task Task 2 involves the recognition ofentities and the assignment of these entities Finally Task 3targets the recognition of negation and speculation Becauseof their fundamental importance and vast potential for futuredevelopments researchers mainly focus on Tasks 1 and 2

Like the other text mining systems the performance ofTEES is evaluated by precision recall and 119865-score The firstmeasure equals the fraction of obtained relevant documentsand the retrieved documents represent the correctness of theextraction system that is

Precision

=|relevant documents cap retrieved documents|

|retrieved documents|

(1)

The second measure recall is defined as

Recall = |relevant documents cap retrieved documents||relevant documents|

(2)

Recall is used to assess the fraction of the documentsrelevant to the query that are successfully retrieved Precisionand recall indicators are well-known performance measuresin textmining while119865-score is a thirdmeasure that combines

BioMed Research International 3

Protein Protein ProteinRegulation Regulation

Regulation

IL-4 Gene regulation Involves NFAT1 and NFAT2

Theme

Theme CauseTheme

Cause

Figure 1The graphical representation of a complex biological event(refer to [5])

precision and recall and is the harmonic mean of precisionand recall

119865-score = 2 sdot precision sdot recallprecision + recall

(3)

The 119865-score evenly weighs the precision and recall andforms a reliable measure to evaluate the performance of a textmining system For more information refer to [28]

212 Scheme of TEES The core of TEES consists of twocomponents that is classification-style triggerevent detec-tion and rich features in graph structure By mapping thetokenized word and entity to the node and mapping eventrelation to edge between entities TEES regards the eventextraction as a task of recognizing graph nodes and edges asshown in Figure 1

Generally TEES first convert the node recognition toa problem of 10-class classification which corresponds to9 events defined in the shared task and another class fornegative case This procedure is defined as trigger detectionThereafter the edge detection is defined by the recognitionof concrete relationships between entities including semanticdirection and themecause relation

As in Figure 1 the word ldquoIL-4rdquo is assigned to class ldquopro-teinrdquo while ldquoinvolvesrdquo is assigned to class ldquoregulationrdquo Theedge between ldquoIL-4rdquo and ldquoregulationrdquo is labelled as ldquothemerdquoHence we obtain a simple regulation event namely ldquoIL-4regulationrdquo which means ldquoIL-4rdquo is the theme of ldquoregulationrdquoSimilarly the representation in Figure 1 indicates that thesimple event ldquoIL-4 regulationrdquo is the theme of ldquoinvolvesrdquo andthe protein ldquoNFAT1rdquo is the cause of ldquoinvolvesrdquo from whicha complex regulation event can be extracted that is ldquoIL-4 regulationrdquo regulates protein ldquoNFAT1rdquo For more detailedinformation refer to Bjornersquos work [5] and the website ofTEES httpbionlputufieventextractionsoftwarehtml

The data set used in TEES consists of four files in GENIAcorpus [3] Each file is given in a stand-off XML formatwith split sentences annotation of known proteins parts ofspeech (POS) and syntactic dependency tree informationOne is train123xml which contains 800 abstracts with7482 sentences another is devel123xml which contains 150abstracts with 1450 sentences The file everything123xml isthe sum of the two files above and testxml is the test dataset which contains 950 abstracts with 8932 sentences

The scheme of TEES system consists of three phases Firstlinguistic features are generated in the feature generationphase Second in the training phase train123xml is used asthe training data set devel123xml is used as development set

and optimum of parameters is obtained Then in the thirdphase using everything123xml (sum of train123xml anddevel123xml) as the training set testxml is used as unknowndata set for event prediction Events are extracted from thisunknown set and accuracy is subsequently evaluated SeeFigure 2

Mainly there are two parameters in grid searching at thetraining stage The first parameter is 119862 in polynomial kernelfunction of support vector machine Second in order to seta proper precision-recall trade-off a parameter 120573 (120573 gt 0)is introduced in trigger detection For ldquono triggerrdquo class thegiven classification score is multiplied by 120573 so as to increasethe possibility of tokens falling into ldquotriggerrdquo class Optionally120573 is set as 06 and it will be put into grid searching forobtaining optimum value

22 Definition of Feature Generation Rule of Turku SystemFeatures used for trigger detection are designed in a rationalway and abundant features are generated from training dataset

In terms of the training data set with GENIA format thedependency relation of each sentence is output by Stanfordparser [29] which addresses the syntactic relations betweenword tokens and thus converts the sentence to a graph wherethe node denotes a word token and the edge correspondsto the grammatical relation between two tokens By doingthis a directed acyclic graph (DAG) is constructed based onthe dependency parse tree and the shortest dependency path(SDP) is located

For a targeted word in sentence which represents a nodein graph the purpose of trigger detection is to recognize theevent type belonging to the word Meanwhile edge detectionis to recognize the themecause type between two entitiesTherefore both detections can be considered as multiclassclassification

The features produced in trigger detection are cat-egorized into six classes as listed in Figure 3 and thewhole feature generation rule is listed in SupplementaryAppendix A in the Supplementary Material available onlineat httpdxdoiorg1011552014205239 First for a sentenceinwhich the targetword occurs token information is countedfor the whole sentence as a bag of words (BOW)This featureclass is defined as ldquosentence featurerdquo The second featureclass is ldquomain featurerdquo of the target word including partof speech (POS) and stem information output by PorterStemmer [29] The third class ldquolinear order featurerdquo focuseson the information about the neighboring word tokens innatural order of the sentence TEES also maintains a fourthclass about the microlexical information of the word tokenfor example upper or lower case of the word existence ofdigital number double letter in the word and three lettersin the word These features constitute a ldquocontent featurerdquoclass while the ldquoattached edge featurerdquo class focuses on theinformation about the neighboring word tokens of the targetword in SDP Finally there is a sixth ldquochain featurerdquo classwhich focuses on the information of the whole SDP instead

Here the features arewell structured from themacro- andmicroperspectives about the sentence Basically the feature

4 BioMed Research International

Second phase training

BioNLP09Trainpy

BioNLP09Trainpy

BioNLP09Trainpy

Parameters

Feature generation

First phase linguistic feature generation

everything123xml testxml

SVM model

Third phase testing

train123xmldevel123xml

Tasks 1 and 2

Figure 2 Data set and pipeline of TEES cited from TEES Toolkit Location TurkuEvent ExtractionSystem readmepdf p 7 in the zip filehttpbionlputufistaticevent-extractorTurkuEventExtractionSystem-10zip

Sentence feature (18998)

Main feature (24944)

Linear order feature (73744)

Content feature (8573)

Attached edge feature (100561)

Chain feature (178345)

Trig

ger f

eatu

res c

lass

Edge

feat

ures

clas

s

Entity feature (4313)

Path length feature (25)

Terminus token feature (4144)

Single element feature (2593)

Path grams feature (441684)

Path edge feature (25038)

Sentence feature (33)

ENIA feature (4)

Figure 3 Feature class in trigger detection and edge detection

generation rule of ldquomain featurerdquo and ldquocontent featurerdquomostlyrelies on microlevel information about the word token itselfwhile ldquosentence featurerdquo and ldquolinear order featurerdquo rely onthe macrolevel information about the sentence Differentlyldquoattached edgerdquo and ldquochain featurerdquo rely on the dependencytree and graph information especially the SDP

Similarly features used for edge detection can be classi-fied into 8 classes namely entity feature path length featureterminus token feature single element feature path grams

feature path edge feature sentence feature and GENIA fea-ture Among the features above the third feature is omittedsince it is considered part of the first feature

23 Feature Evaluation Method Accumulated Effect Evalu-ation (AEE) Algorithm A quantitative method is designedto evaluate the importance of feature classes Here thefeature combination methods are ranked by 119865-score and the

BioMed Research International 5

Step 1 Categorize features into 119899 feature classes and there are 2119899 minus 1 possibilities for all kinds of featurecombination Denote119873 = 2119899 minus 1 and run all of classifying test 119865-score is recorded foreach experiment

Step 2 Sort119873 feature combinations according to 119865-score with the descending orderStep 3 FOR 119894 = 1 to 119899 119895 = 1 to119873

119874119894119895= 0 119888119894119895= 0

END FOR 119874119894119895is used to calculate the occurrence of the 119894th feature among 119895 experiments

119888119894119895evaluates the partial importance of the 119894th feature after 119895 times experiments

Step 4 FOR 119895 = 1 to119873 For each feature combination in the 119895th experimentFOR 119894 = 1 to 119899 For each feature

IF the 119894th feature occur in the 119895th feature combination119874119894119895++

For each experiment the occurrence of the 119894th feature is accumulatedEND IF119888119894119895= 119874119894119895119895

By dividing 119874119894119895by 119895 we get a numerical value to evaluate the importance

of the 119894th feature Since the items are ranked by 119865-score with a descending order the first ranked features combination corresponds to a smaller 119895 and hence a bigger 119888

119894119895

END FOREND FOR

Step 5 FOR 119894 = 1 to 119899

AEE1(119894) =119899

sum

119895=1

119888119894119895

END FOR By summing up 119888

119894119895for a fixed 119894 we get the accumulated effect of the 119894th feature

Step 6 AEE1(119894) is the objective function of greedy search algorithm and we sort AEE1(119894)to get the most important feature

Algorithm 1 AEE1 algorithm

occurrence of 119894th feature class in the top 119895th combinations(119895 runs from the top one to the last one) is counted therate of occurrence is computed and finally the sum ofthe occurrence is calculated The total calculation is shownin Supplementary Material Appendix C The algorithm isdenoted as AEE1 algorithm as shown in Algorithm 1

Here AEE1(119894) is the contribution value of the 119894th featureand also the objective function of the greedy search algo-rithm which reflects the accumulated effect of the 119894th featureamong the top classifiers Since AEE1(119894) makes sense in acomparative way the theoretical maximum and minimum ofAEE1(119894) are calculated by

Max AEE1 = 1 times 2119899minus1 + 2119899minus1

2119899minus1 + 1+2119899minus1

2119899minus1 + 2sdot sdot sdot +2119899minus1

2119899 minus 1

Min AEE1 = 0 times (2119899minus1 minus 1) + 12119899minus1+2

2119899minus1 + 1sdot sdot sdot +2119899minus1

2119899 minus 1

(4)

The idea of AEE1 comes from the understanding that thetop classifiers with higher119865-scores includemore efficient andimportant features However this consideration disregardsthe feature size in terms of those top classifiers

For better understanding a simple case is consideredAssume there are two feature classes that could be used in the

classifier and so there are three feature combinations for theclassifier namely 1 2 and 1amp2 Without loss of generality weassume that the best classifier uses the feature class 1amp2 thesecond one uses the feature class 1 and the worst classifieruses the feature class 2 Here we denote the rank list 1amp2 1and 2 as Rank Result A According to the third column inTable 1(a) AEE value will be calculated as AEE1(1) = 11 + 22+ 23 = 2667

As another example we assume a rank list 1 1amp2 and 2which is denoted as Rank Result B and shown in Table 1(b)As can be seen fromTable 1(b) AEE1(1) equals 2667 the sameas in Table 1(a) However this is not a reasonable result sincefeature 1 is more significant in Rank Result A than in RankResult B if the feature class size is considered

Therefore an alternative algorithm AEE2 is proposed byupdating 119888

119894119895= 119874119894119895119895 in Step 4 of AEE1 as 119888

119894119895= 119874119894119895sum119899

119894=1119874119894119895

Here the size of feature class is considered for the compu-tation of 119888

119894119895and a classifier with higher performance and

smaller feature size will ensure a high score for the featureit owns As an example column 4 in Tables 1(a) and 1(b)shows that AEE2(1) = 1667 in Rank Result A and AEE2(1)= 2167 in Rank Result B The result ensures the comparativeadvantage for feature class 1 in Rank Result A SimilarlyAEE2(119894) mainly makes sense in a comparative way with thetheoretical maximum and minimum values which are alsosimilarly computed

6 BioMed Research International

Table 1 Examples for AEEi algorithm

(a) AEEi result for Ranking Result A

Rank Result A 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1amp2 1 1 11 11 12 12119895 = 2 1 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 2167 AEE2(119894) 1667 1333

(b) AEEi result for Ranking Result B

Rank Result B 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1 1 0 11 01 11 01119895 = 2 1amp2 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 1167 AEE2(119894) 2167 0833

Input data with chosen feature

classes under all combination

Fixed features in edge detection

try trigger features combination

Fixed features in trigger detection

try edge features combination

Test the classifier performance

in Turku system

Evaluate the contribution

of feature classes

Best

Yes

No Modify and replace

important features

Output feature set

and build new system

Figure 4 Flowchart of the research

Considering AEE1(1) = 2667 gt AEE1(2) = 2167 andAEE2(1) = 1667 gt AEE2(2) = 1333 in Table 1(a) the impor-tance of feature class 1 prevails over that of feature class 2 inboth cases This ensures a reasonable consistency Actuallyboth AEE1 and AEE2 algorithms ensure a correct weightingrank among different feature classes and a hybrid of the twomethods is used in the experiments

24 Flowchart of the Research Unlike a trial-and-error pro-cedure the scheme of this research is oriented towardsbetter feature selection so that important feature classes areidentified by evaluating the contribution of the individualfeatures

Accordingly codes are written to enhance vital featuresThereafter the classifiers with new updated features are testedand better performance is presumed Compared with theprevious TEES system our main contribution is to importfeature selection strategy in Phase 1 that is ldquolinguistic featureselectionrdquo as shown in Figure 2The flowchart of our researchis shown in Figure 4

3 Results

31 Evaluation of Individual Feature Classes by QuantitativeMethod Based on the AEE algorithm all of the combina-tions of the feature classes are used in classifiers and their

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

BioMed Research International 3

Protein Protein ProteinRegulation Regulation

Regulation

IL-4 Gene regulation Involves NFAT1 and NFAT2

Theme

Theme CauseTheme

Cause

Figure 1The graphical representation of a complex biological event(refer to [5])

precision and recall and is the harmonic mean of precisionand recall

119865-score = 2 sdot precision sdot recallprecision + recall

(3)

The 119865-score evenly weighs the precision and recall andforms a reliable measure to evaluate the performance of a textmining system For more information refer to [28]

212 Scheme of TEES The core of TEES consists of twocomponents that is classification-style triggerevent detec-tion and rich features in graph structure By mapping thetokenized word and entity to the node and mapping eventrelation to edge between entities TEES regards the eventextraction as a task of recognizing graph nodes and edges asshown in Figure 1

Generally TEES first convert the node recognition toa problem of 10-class classification which corresponds to9 events defined in the shared task and another class fornegative case This procedure is defined as trigger detectionThereafter the edge detection is defined by the recognitionof concrete relationships between entities including semanticdirection and themecause relation

As in Figure 1 the word ldquoIL-4rdquo is assigned to class ldquopro-teinrdquo while ldquoinvolvesrdquo is assigned to class ldquoregulationrdquo Theedge between ldquoIL-4rdquo and ldquoregulationrdquo is labelled as ldquothemerdquoHence we obtain a simple regulation event namely ldquoIL-4regulationrdquo which means ldquoIL-4rdquo is the theme of ldquoregulationrdquoSimilarly the representation in Figure 1 indicates that thesimple event ldquoIL-4 regulationrdquo is the theme of ldquoinvolvesrdquo andthe protein ldquoNFAT1rdquo is the cause of ldquoinvolvesrdquo from whicha complex regulation event can be extracted that is ldquoIL-4 regulationrdquo regulates protein ldquoNFAT1rdquo For more detailedinformation refer to Bjornersquos work [5] and the website ofTEES httpbionlputufieventextractionsoftwarehtml

The data set used in TEES consists of four files in GENIAcorpus [3] Each file is given in a stand-off XML formatwith split sentences annotation of known proteins parts ofspeech (POS) and syntactic dependency tree informationOne is train123xml which contains 800 abstracts with7482 sentences another is devel123xml which contains 150abstracts with 1450 sentences The file everything123xml isthe sum of the two files above and testxml is the test dataset which contains 950 abstracts with 8932 sentences

The scheme of TEES system consists of three phases Firstlinguistic features are generated in the feature generationphase Second in the training phase train123xml is used asthe training data set devel123xml is used as development set

and optimum of parameters is obtained Then in the thirdphase using everything123xml (sum of train123xml anddevel123xml) as the training set testxml is used as unknowndata set for event prediction Events are extracted from thisunknown set and accuracy is subsequently evaluated SeeFigure 2

Mainly there are two parameters in grid searching at thetraining stage The first parameter is 119862 in polynomial kernelfunction of support vector machine Second in order to seta proper precision-recall trade-off a parameter 120573 (120573 gt 0)is introduced in trigger detection For ldquono triggerrdquo class thegiven classification score is multiplied by 120573 so as to increasethe possibility of tokens falling into ldquotriggerrdquo class Optionally120573 is set as 06 and it will be put into grid searching forobtaining optimum value

22 Definition of Feature Generation Rule of Turku SystemFeatures used for trigger detection are designed in a rationalway and abundant features are generated from training dataset

In terms of the training data set with GENIA format thedependency relation of each sentence is output by Stanfordparser [29] which addresses the syntactic relations betweenword tokens and thus converts the sentence to a graph wherethe node denotes a word token and the edge correspondsto the grammatical relation between two tokens By doingthis a directed acyclic graph (DAG) is constructed based onthe dependency parse tree and the shortest dependency path(SDP) is located

For a targeted word in sentence which represents a nodein graph the purpose of trigger detection is to recognize theevent type belonging to the word Meanwhile edge detectionis to recognize the themecause type between two entitiesTherefore both detections can be considered as multiclassclassification

The features produced in trigger detection are cat-egorized into six classes as listed in Figure 3 and thewhole feature generation rule is listed in SupplementaryAppendix A in the Supplementary Material available onlineat httpdxdoiorg1011552014205239 First for a sentenceinwhich the targetword occurs token information is countedfor the whole sentence as a bag of words (BOW)This featureclass is defined as ldquosentence featurerdquo The second featureclass is ldquomain featurerdquo of the target word including partof speech (POS) and stem information output by PorterStemmer [29] The third class ldquolinear order featurerdquo focuseson the information about the neighboring word tokens innatural order of the sentence TEES also maintains a fourthclass about the microlexical information of the word tokenfor example upper or lower case of the word existence ofdigital number double letter in the word and three lettersin the word These features constitute a ldquocontent featurerdquoclass while the ldquoattached edge featurerdquo class focuses on theinformation about the neighboring word tokens of the targetword in SDP Finally there is a sixth ldquochain featurerdquo classwhich focuses on the information of the whole SDP instead

Here the features arewell structured from themacro- andmicroperspectives about the sentence Basically the feature

4 BioMed Research International

Second phase training

BioNLP09Trainpy

BioNLP09Trainpy

BioNLP09Trainpy

Parameters

Feature generation

First phase linguistic feature generation

everything123xml testxml

SVM model

Third phase testing

train123xmldevel123xml

Tasks 1 and 2

Figure 2 Data set and pipeline of TEES cited from TEES Toolkit Location TurkuEvent ExtractionSystem readmepdf p 7 in the zip filehttpbionlputufistaticevent-extractorTurkuEventExtractionSystem-10zip

Sentence feature (18998)

Main feature (24944)

Linear order feature (73744)

Content feature (8573)

Attached edge feature (100561)

Chain feature (178345)

Trig

ger f

eatu

res c

lass

Edge

feat

ures

clas

s

Entity feature (4313)

Path length feature (25)

Terminus token feature (4144)

Single element feature (2593)

Path grams feature (441684)

Path edge feature (25038)

Sentence feature (33)

ENIA feature (4)

Figure 3 Feature class in trigger detection and edge detection

generation rule of ldquomain featurerdquo and ldquocontent featurerdquomostlyrelies on microlevel information about the word token itselfwhile ldquosentence featurerdquo and ldquolinear order featurerdquo rely onthe macrolevel information about the sentence Differentlyldquoattached edgerdquo and ldquochain featurerdquo rely on the dependencytree and graph information especially the SDP

Similarly features used for edge detection can be classi-fied into 8 classes namely entity feature path length featureterminus token feature single element feature path grams

feature path edge feature sentence feature and GENIA fea-ture Among the features above the third feature is omittedsince it is considered part of the first feature

23 Feature Evaluation Method Accumulated Effect Evalu-ation (AEE) Algorithm A quantitative method is designedto evaluate the importance of feature classes Here thefeature combination methods are ranked by 119865-score and the

BioMed Research International 5

Step 1 Categorize features into 119899 feature classes and there are 2119899 minus 1 possibilities for all kinds of featurecombination Denote119873 = 2119899 minus 1 and run all of classifying test 119865-score is recorded foreach experiment

Step 2 Sort119873 feature combinations according to 119865-score with the descending orderStep 3 FOR 119894 = 1 to 119899 119895 = 1 to119873

119874119894119895= 0 119888119894119895= 0

END FOR 119874119894119895is used to calculate the occurrence of the 119894th feature among 119895 experiments

119888119894119895evaluates the partial importance of the 119894th feature after 119895 times experiments

Step 4 FOR 119895 = 1 to119873 For each feature combination in the 119895th experimentFOR 119894 = 1 to 119899 For each feature

IF the 119894th feature occur in the 119895th feature combination119874119894119895++

For each experiment the occurrence of the 119894th feature is accumulatedEND IF119888119894119895= 119874119894119895119895

By dividing 119874119894119895by 119895 we get a numerical value to evaluate the importance

of the 119894th feature Since the items are ranked by 119865-score with a descending order the first ranked features combination corresponds to a smaller 119895 and hence a bigger 119888

119894119895

END FOREND FOR

Step 5 FOR 119894 = 1 to 119899

AEE1(119894) =119899

sum

119895=1

119888119894119895

END FOR By summing up 119888

119894119895for a fixed 119894 we get the accumulated effect of the 119894th feature

Step 6 AEE1(119894) is the objective function of greedy search algorithm and we sort AEE1(119894)to get the most important feature

Algorithm 1 AEE1 algorithm

occurrence of 119894th feature class in the top 119895th combinations(119895 runs from the top one to the last one) is counted therate of occurrence is computed and finally the sum ofthe occurrence is calculated The total calculation is shownin Supplementary Material Appendix C The algorithm isdenoted as AEE1 algorithm as shown in Algorithm 1

Here AEE1(119894) is the contribution value of the 119894th featureand also the objective function of the greedy search algo-rithm which reflects the accumulated effect of the 119894th featureamong the top classifiers Since AEE1(119894) makes sense in acomparative way the theoretical maximum and minimum ofAEE1(119894) are calculated by

Max AEE1 = 1 times 2119899minus1 + 2119899minus1

2119899minus1 + 1+2119899minus1

2119899minus1 + 2sdot sdot sdot +2119899minus1

2119899 minus 1

Min AEE1 = 0 times (2119899minus1 minus 1) + 12119899minus1+2

2119899minus1 + 1sdot sdot sdot +2119899minus1

2119899 minus 1

(4)

The idea of AEE1 comes from the understanding that thetop classifiers with higher119865-scores includemore efficient andimportant features However this consideration disregardsthe feature size in terms of those top classifiers

For better understanding a simple case is consideredAssume there are two feature classes that could be used in the

classifier and so there are three feature combinations for theclassifier namely 1 2 and 1amp2 Without loss of generality weassume that the best classifier uses the feature class 1amp2 thesecond one uses the feature class 1 and the worst classifieruses the feature class 2 Here we denote the rank list 1amp2 1and 2 as Rank Result A According to the third column inTable 1(a) AEE value will be calculated as AEE1(1) = 11 + 22+ 23 = 2667

As another example we assume a rank list 1 1amp2 and 2which is denoted as Rank Result B and shown in Table 1(b)As can be seen fromTable 1(b) AEE1(1) equals 2667 the sameas in Table 1(a) However this is not a reasonable result sincefeature 1 is more significant in Rank Result A than in RankResult B if the feature class size is considered

Therefore an alternative algorithm AEE2 is proposed byupdating 119888

119894119895= 119874119894119895119895 in Step 4 of AEE1 as 119888

119894119895= 119874119894119895sum119899

119894=1119874119894119895

Here the size of feature class is considered for the compu-tation of 119888

119894119895and a classifier with higher performance and

smaller feature size will ensure a high score for the featureit owns As an example column 4 in Tables 1(a) and 1(b)shows that AEE2(1) = 1667 in Rank Result A and AEE2(1)= 2167 in Rank Result B The result ensures the comparativeadvantage for feature class 1 in Rank Result A SimilarlyAEE2(119894) mainly makes sense in a comparative way with thetheoretical maximum and minimum values which are alsosimilarly computed

6 BioMed Research International

Table 1 Examples for AEEi algorithm

(a) AEEi result for Ranking Result A

Rank Result A 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1amp2 1 1 11 11 12 12119895 = 2 1 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 2167 AEE2(119894) 1667 1333

(b) AEEi result for Ranking Result B

Rank Result B 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1 1 0 11 01 11 01119895 = 2 1amp2 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 1167 AEE2(119894) 2167 0833

Input data with chosen feature

classes under all combination

Fixed features in edge detection

try trigger features combination

Fixed features in trigger detection

try edge features combination

Test the classifier performance

in Turku system

Evaluate the contribution

of feature classes

Best

Yes

No Modify and replace

important features

Output feature set

and build new system

Figure 4 Flowchart of the research

Considering AEE1(1) = 2667 gt AEE1(2) = 2167 andAEE2(1) = 1667 gt AEE2(2) = 1333 in Table 1(a) the impor-tance of feature class 1 prevails over that of feature class 2 inboth cases This ensures a reasonable consistency Actuallyboth AEE1 and AEE2 algorithms ensure a correct weightingrank among different feature classes and a hybrid of the twomethods is used in the experiments

24 Flowchart of the Research Unlike a trial-and-error pro-cedure the scheme of this research is oriented towardsbetter feature selection so that important feature classes areidentified by evaluating the contribution of the individualfeatures

Accordingly codes are written to enhance vital featuresThereafter the classifiers with new updated features are testedand better performance is presumed Compared with theprevious TEES system our main contribution is to importfeature selection strategy in Phase 1 that is ldquolinguistic featureselectionrdquo as shown in Figure 2The flowchart of our researchis shown in Figure 4

3 Results

31 Evaluation of Individual Feature Classes by QuantitativeMethod Based on the AEE algorithm all of the combina-tions of the feature classes are used in classifiers and their

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

4 BioMed Research International

Second phase training

BioNLP09Trainpy

BioNLP09Trainpy

BioNLP09Trainpy

Parameters

Feature generation

First phase linguistic feature generation

everything123xml testxml

SVM model

Third phase testing

train123xmldevel123xml

Tasks 1 and 2

Figure 2 Data set and pipeline of TEES cited from TEES Toolkit Location TurkuEvent ExtractionSystem readmepdf p 7 in the zip filehttpbionlputufistaticevent-extractorTurkuEventExtractionSystem-10zip

Sentence feature (18998)

Main feature (24944)

Linear order feature (73744)

Content feature (8573)

Attached edge feature (100561)

Chain feature (178345)

Trig

ger f

eatu

res c

lass

Edge

feat

ures

clas

s

Entity feature (4313)

Path length feature (25)

Terminus token feature (4144)

Single element feature (2593)

Path grams feature (441684)

Path edge feature (25038)

Sentence feature (33)

ENIA feature (4)

Figure 3 Feature class in trigger detection and edge detection

generation rule of ldquomain featurerdquo and ldquocontent featurerdquomostlyrelies on microlevel information about the word token itselfwhile ldquosentence featurerdquo and ldquolinear order featurerdquo rely onthe macrolevel information about the sentence Differentlyldquoattached edgerdquo and ldquochain featurerdquo rely on the dependencytree and graph information especially the SDP

Similarly features used for edge detection can be classi-fied into 8 classes namely entity feature path length featureterminus token feature single element feature path grams

feature path edge feature sentence feature and GENIA fea-ture Among the features above the third feature is omittedsince it is considered part of the first feature

23 Feature Evaluation Method Accumulated Effect Evalu-ation (AEE) Algorithm A quantitative method is designedto evaluate the importance of feature classes Here thefeature combination methods are ranked by 119865-score and the

BioMed Research International 5

Step 1 Categorize features into 119899 feature classes and there are 2119899 minus 1 possibilities for all kinds of featurecombination Denote119873 = 2119899 minus 1 and run all of classifying test 119865-score is recorded foreach experiment

Step 2 Sort119873 feature combinations according to 119865-score with the descending orderStep 3 FOR 119894 = 1 to 119899 119895 = 1 to119873

119874119894119895= 0 119888119894119895= 0

END FOR 119874119894119895is used to calculate the occurrence of the 119894th feature among 119895 experiments

119888119894119895evaluates the partial importance of the 119894th feature after 119895 times experiments

Step 4 FOR 119895 = 1 to119873 For each feature combination in the 119895th experimentFOR 119894 = 1 to 119899 For each feature

IF the 119894th feature occur in the 119895th feature combination119874119894119895++

For each experiment the occurrence of the 119894th feature is accumulatedEND IF119888119894119895= 119874119894119895119895

By dividing 119874119894119895by 119895 we get a numerical value to evaluate the importance

of the 119894th feature Since the items are ranked by 119865-score with a descending order the first ranked features combination corresponds to a smaller 119895 and hence a bigger 119888

119894119895

END FOREND FOR

Step 5 FOR 119894 = 1 to 119899

AEE1(119894) =119899

sum

119895=1

119888119894119895

END FOR By summing up 119888

119894119895for a fixed 119894 we get the accumulated effect of the 119894th feature

Step 6 AEE1(119894) is the objective function of greedy search algorithm and we sort AEE1(119894)to get the most important feature

Algorithm 1 AEE1 algorithm

occurrence of 119894th feature class in the top 119895th combinations(119895 runs from the top one to the last one) is counted therate of occurrence is computed and finally the sum ofthe occurrence is calculated The total calculation is shownin Supplementary Material Appendix C The algorithm isdenoted as AEE1 algorithm as shown in Algorithm 1

Here AEE1(119894) is the contribution value of the 119894th featureand also the objective function of the greedy search algo-rithm which reflects the accumulated effect of the 119894th featureamong the top classifiers Since AEE1(119894) makes sense in acomparative way the theoretical maximum and minimum ofAEE1(119894) are calculated by

Max AEE1 = 1 times 2119899minus1 + 2119899minus1

2119899minus1 + 1+2119899minus1

2119899minus1 + 2sdot sdot sdot +2119899minus1

2119899 minus 1

Min AEE1 = 0 times (2119899minus1 minus 1) + 12119899minus1+2

2119899minus1 + 1sdot sdot sdot +2119899minus1

2119899 minus 1

(4)

The idea of AEE1 comes from the understanding that thetop classifiers with higher119865-scores includemore efficient andimportant features However this consideration disregardsthe feature size in terms of those top classifiers

For better understanding a simple case is consideredAssume there are two feature classes that could be used in the

classifier and so there are three feature combinations for theclassifier namely 1 2 and 1amp2 Without loss of generality weassume that the best classifier uses the feature class 1amp2 thesecond one uses the feature class 1 and the worst classifieruses the feature class 2 Here we denote the rank list 1amp2 1and 2 as Rank Result A According to the third column inTable 1(a) AEE value will be calculated as AEE1(1) = 11 + 22+ 23 = 2667

As another example we assume a rank list 1 1amp2 and 2which is denoted as Rank Result B and shown in Table 1(b)As can be seen fromTable 1(b) AEE1(1) equals 2667 the sameas in Table 1(a) However this is not a reasonable result sincefeature 1 is more significant in Rank Result A than in RankResult B if the feature class size is considered

Therefore an alternative algorithm AEE2 is proposed byupdating 119888

119894119895= 119874119894119895119895 in Step 4 of AEE1 as 119888

119894119895= 119874119894119895sum119899

119894=1119874119894119895

Here the size of feature class is considered for the compu-tation of 119888

119894119895and a classifier with higher performance and

smaller feature size will ensure a high score for the featureit owns As an example column 4 in Tables 1(a) and 1(b)shows that AEE2(1) = 1667 in Rank Result A and AEE2(1)= 2167 in Rank Result B The result ensures the comparativeadvantage for feature class 1 in Rank Result A SimilarlyAEE2(119894) mainly makes sense in a comparative way with thetheoretical maximum and minimum values which are alsosimilarly computed

6 BioMed Research International

Table 1 Examples for AEEi algorithm

(a) AEEi result for Ranking Result A

Rank Result A 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1amp2 1 1 11 11 12 12119895 = 2 1 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 2167 AEE2(119894) 1667 1333

(b) AEEi result for Ranking Result B

Rank Result B 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1 1 0 11 01 11 01119895 = 2 1amp2 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 1167 AEE2(119894) 2167 0833

Input data with chosen feature

classes under all combination

Fixed features in edge detection

try trigger features combination

Fixed features in trigger detection

try edge features combination

Test the classifier performance

in Turku system

Evaluate the contribution

of feature classes

Best

Yes

No Modify and replace

important features

Output feature set

and build new system

Figure 4 Flowchart of the research

Considering AEE1(1) = 2667 gt AEE1(2) = 2167 andAEE2(1) = 1667 gt AEE2(2) = 1333 in Table 1(a) the impor-tance of feature class 1 prevails over that of feature class 2 inboth cases This ensures a reasonable consistency Actuallyboth AEE1 and AEE2 algorithms ensure a correct weightingrank among different feature classes and a hybrid of the twomethods is used in the experiments

24 Flowchart of the Research Unlike a trial-and-error pro-cedure the scheme of this research is oriented towardsbetter feature selection so that important feature classes areidentified by evaluating the contribution of the individualfeatures

Accordingly codes are written to enhance vital featuresThereafter the classifiers with new updated features are testedand better performance is presumed Compared with theprevious TEES system our main contribution is to importfeature selection strategy in Phase 1 that is ldquolinguistic featureselectionrdquo as shown in Figure 2The flowchart of our researchis shown in Figure 4

3 Results

31 Evaluation of Individual Feature Classes by QuantitativeMethod Based on the AEE algorithm all of the combina-tions of the feature classes are used in classifiers and their

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

BioMed Research International 5

Step 1 Categorize features into 119899 feature classes and there are 2119899 minus 1 possibilities for all kinds of featurecombination Denote119873 = 2119899 minus 1 and run all of classifying test 119865-score is recorded foreach experiment

Step 2 Sort119873 feature combinations according to 119865-score with the descending orderStep 3 FOR 119894 = 1 to 119899 119895 = 1 to119873

119874119894119895= 0 119888119894119895= 0

END FOR 119874119894119895is used to calculate the occurrence of the 119894th feature among 119895 experiments

119888119894119895evaluates the partial importance of the 119894th feature after 119895 times experiments

Step 4 FOR 119895 = 1 to119873 For each feature combination in the 119895th experimentFOR 119894 = 1 to 119899 For each feature

IF the 119894th feature occur in the 119895th feature combination119874119894119895++

For each experiment the occurrence of the 119894th feature is accumulatedEND IF119888119894119895= 119874119894119895119895

By dividing 119874119894119895by 119895 we get a numerical value to evaluate the importance

of the 119894th feature Since the items are ranked by 119865-score with a descending order the first ranked features combination corresponds to a smaller 119895 and hence a bigger 119888

119894119895

END FOREND FOR

Step 5 FOR 119894 = 1 to 119899

AEE1(119894) =119899

sum

119895=1

119888119894119895

END FOR By summing up 119888

119894119895for a fixed 119894 we get the accumulated effect of the 119894th feature

Step 6 AEE1(119894) is the objective function of greedy search algorithm and we sort AEE1(119894)to get the most important feature

Algorithm 1 AEE1 algorithm

occurrence of 119894th feature class in the top 119895th combinations(119895 runs from the top one to the last one) is counted therate of occurrence is computed and finally the sum ofthe occurrence is calculated The total calculation is shownin Supplementary Material Appendix C The algorithm isdenoted as AEE1 algorithm as shown in Algorithm 1

Here AEE1(119894) is the contribution value of the 119894th featureand also the objective function of the greedy search algo-rithm which reflects the accumulated effect of the 119894th featureamong the top classifiers Since AEE1(119894) makes sense in acomparative way the theoretical maximum and minimum ofAEE1(119894) are calculated by

Max AEE1 = 1 times 2119899minus1 + 2119899minus1

2119899minus1 + 1+2119899minus1

2119899minus1 + 2sdot sdot sdot +2119899minus1

2119899 minus 1

Min AEE1 = 0 times (2119899minus1 minus 1) + 12119899minus1+2

2119899minus1 + 1sdot sdot sdot +2119899minus1

2119899 minus 1

(4)

The idea of AEE1 comes from the understanding that thetop classifiers with higher119865-scores includemore efficient andimportant features However this consideration disregardsthe feature size in terms of those top classifiers

For better understanding a simple case is consideredAssume there are two feature classes that could be used in the

classifier and so there are three feature combinations for theclassifier namely 1 2 and 1amp2 Without loss of generality weassume that the best classifier uses the feature class 1amp2 thesecond one uses the feature class 1 and the worst classifieruses the feature class 2 Here we denote the rank list 1amp2 1and 2 as Rank Result A According to the third column inTable 1(a) AEE value will be calculated as AEE1(1) = 11 + 22+ 23 = 2667

As another example we assume a rank list 1 1amp2 and 2which is denoted as Rank Result B and shown in Table 1(b)As can be seen fromTable 1(b) AEE1(1) equals 2667 the sameas in Table 1(a) However this is not a reasonable result sincefeature 1 is more significant in Rank Result A than in RankResult B if the feature class size is considered

Therefore an alternative algorithm AEE2 is proposed byupdating 119888

119894119895= 119874119894119895119895 in Step 4 of AEE1 as 119888

119894119895= 119874119894119895sum119899

119894=1119874119894119895

Here the size of feature class is considered for the compu-tation of 119888

119894119895and a classifier with higher performance and

smaller feature size will ensure a high score for the featureit owns As an example column 4 in Tables 1(a) and 1(b)shows that AEE2(1) = 1667 in Rank Result A and AEE2(1)= 2167 in Rank Result B The result ensures the comparativeadvantage for feature class 1 in Rank Result A SimilarlyAEE2(119894) mainly makes sense in a comparative way with thetheoretical maximum and minimum values which are alsosimilarly computed

6 BioMed Research International

Table 1 Examples for AEEi algorithm

(a) AEEi result for Ranking Result A

Rank Result A 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1amp2 1 1 11 11 12 12119895 = 2 1 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 2167 AEE2(119894) 1667 1333

(b) AEEi result for Ranking Result B

Rank Result B 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1 1 0 11 01 11 01119895 = 2 1amp2 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 1167 AEE2(119894) 2167 0833

Input data with chosen feature

classes under all combination

Fixed features in edge detection

try trigger features combination

Fixed features in trigger detection

try edge features combination

Test the classifier performance

in Turku system

Evaluate the contribution

of feature classes

Best

Yes

No Modify and replace

important features

Output feature set

and build new system

Figure 4 Flowchart of the research

Considering AEE1(1) = 2667 gt AEE1(2) = 2167 andAEE2(1) = 1667 gt AEE2(2) = 1333 in Table 1(a) the impor-tance of feature class 1 prevails over that of feature class 2 inboth cases This ensures a reasonable consistency Actuallyboth AEE1 and AEE2 algorithms ensure a correct weightingrank among different feature classes and a hybrid of the twomethods is used in the experiments

24 Flowchart of the Research Unlike a trial-and-error pro-cedure the scheme of this research is oriented towardsbetter feature selection so that important feature classes areidentified by evaluating the contribution of the individualfeatures

Accordingly codes are written to enhance vital featuresThereafter the classifiers with new updated features are testedand better performance is presumed Compared with theprevious TEES system our main contribution is to importfeature selection strategy in Phase 1 that is ldquolinguistic featureselectionrdquo as shown in Figure 2The flowchart of our researchis shown in Figure 4

3 Results

31 Evaluation of Individual Feature Classes by QuantitativeMethod Based on the AEE algorithm all of the combina-tions of the feature classes are used in classifiers and their

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

6 BioMed Research International

Table 1 Examples for AEEi algorithm

(a) AEEi result for Ranking Result A

Rank Result A 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1amp2 1 1 11 11 12 12119895 = 2 1 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 2167 AEE2(119894) 1667 1333

(b) AEEi result for Ranking Result B

Rank Result B 119874119894119895

AEE1 119888119894119895

AEE2 119888119894119895

119894 = 1 119894 = 2 119894 = 1 119894 = 2 119894 = 1 119894 = 2

119895 = 1 1 1 0 11 01 11 01119895 = 2 1amp2 2 1 22 12 23 13119895 = 3 2 2 2 23 23 24 24

AEE1(119894) 2667 1167 AEE2(119894) 2167 0833

Input data with chosen feature

classes under all combination

Fixed features in edge detection

try trigger features combination

Fixed features in trigger detection

try edge features combination

Test the classifier performance

in Turku system

Evaluate the contribution

of feature classes

Best

Yes

No Modify and replace

important features

Output feature set

and build new system

Figure 4 Flowchart of the research

Considering AEE1(1) = 2667 gt AEE1(2) = 2167 andAEE2(1) = 1667 gt AEE2(2) = 1333 in Table 1(a) the impor-tance of feature class 1 prevails over that of feature class 2 inboth cases This ensures a reasonable consistency Actuallyboth AEE1 and AEE2 algorithms ensure a correct weightingrank among different feature classes and a hybrid of the twomethods is used in the experiments

24 Flowchart of the Research Unlike a trial-and-error pro-cedure the scheme of this research is oriented towardsbetter feature selection so that important feature classes areidentified by evaluating the contribution of the individualfeatures

Accordingly codes are written to enhance vital featuresThereafter the classifiers with new updated features are testedand better performance is presumed Compared with theprevious TEES system our main contribution is to importfeature selection strategy in Phase 1 that is ldquolinguistic featureselectionrdquo as shown in Figure 2The flowchart of our researchis shown in Figure 4

3 Results

31 Evaluation of Individual Feature Classes by QuantitativeMethod Based on the AEE algorithm all of the combina-tions of the feature classes are used in classifiers and their

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

BioMed Research International 7

Table 2 Top 10 best classifiers with corresponding feature classes in combination experiment

Rank 119865-score ()Trigger feature

combination with fixedtrigger feature

119865-score () Edge feature combinationwith fixed edge feature

1st 5134 1amp2amp3amp4amp5 5216 1amp2amp4amp5amp6amp7amp82nd 5121 1amp2amp3amp4amp5amp6 5181 1amp2amp4amp5amp6amp83rd 5071 1amp2amp4amp5 5129 1amp2amp5amp6amp7amp84th 5019 1amp2amp3amp4amp6 5118 1amp2amp5amp6amp85th 499 1amp2amp4amp5amp6 5033 1amp2amp4amp6amp7amp86th 4974 1amp3amp4amp5 5023 1amp2amp4amp6amp87th 4944 1amp4amp5 4926 2amp4amp5amp6amp88th 4916 1amp2amp3amp4 4902 1amp2amp4amp5amp69th 4882 1amp2amp4amp6 4832 2amp4amp5amp6amp7amp810th 4782 1amp2amp4 4742 2amp5amp6amp8

Table 3 Score of features in trigger detection

Feature ID 1 2 3 4 5 6

Feature name Sentencefeature Main feature Linear order

featureContentfeature

Attachededge feature Chain feature Theoretical

maximumTheoreticalminimum

AEE1 score 4083 3994 3299 5223 3612 3009 5343 1027AEE2 score 1080 1082 899 1416 980 843 2052 328

Table 4 Score of features in edge detection

Feature ID 1 2 4 5 6 7 8Feature Entity Path length Single element Path grams Path edge Sentence GENIA Theoretical TheoreticalName Feature Feature Feature Feature Feature Feature Feature Maximum MinimumAEE1 score 7760 8384 6845 7488 7776 5609 6635 10761 2008AEE2score 1863 1957 1657 1751 1788 1340 1545 3580 554

corresponding 119865-scores are ranked so as to evaluate thecontribution of the individual feature classes

Using the quantitative algorithm the importance offeature classes is addressed in the two-phase TEES procedurethat involves trigger detection and edge detection For betterunderstanding the top ten classifiers with respective featurecombinations are shown in Table 2 and the total calculationis shown in Supplementary Material Appendix B The finalresults are collected in Tables 3 and 4

During the trigger detection the results show that the4th feature performs best and 6th feature performs worst Bycalculating AEEi value of features Figures 5 and 6 show plotsof the best and the worst feature classes in trigger detection

The comparison between the two features shows how the4th feature performs better than 6th feature And 5223 and3009 also correspond to the value of area below the curveThe AEE1 and AEE2 plot of the best and worst features intrigger and edge detections are shown in Figures 5 and 6

32 Modification of Features and New Experiment ResultsCombinations of features for trigger detection show that the

ldquocontent featurerdquo in trigger detection is the most importantone and the ldquochain featurerdquo is the worst This result showsthat in terms of identifying a target word token the tar-get itself provides more information than the neighboringtokens

Taking the feature generation rule of 4th feature into con-sideration the ldquocontent featurerdquo class contains four featuresldquoupperrdquo ldquohasrdquo ldquodtrdquo and ldquottrdquo Specifically ldquoupperrdquo is to identifythe upper case or lower case of letters ldquohasrdquo is to address theexistence of a digit or hyphen ldquodtrdquo is to record the continuousdouble letters and ldquottrdquo is to record three continuous lettersSince the content feature is vital for trigger detection thefeature generation rule could be strengthened similarlyAccordingly a new ldquoftrdquo feature is inserted into the ldquocontentfeaturerdquo class by which the consecutive four letters in theword are considered Moreover modification is performedon the 6th feature class by merging similar features related todependency trees in both the 5th and the 6th feature classesFor simplicity the updated features are denoted as 41015840 and 61015840

Furthermore if we compare the best performancebetween classifiers with trigger feature comprising the orig-inal features 41015840 features 61015840 features or 41015840amp61015840 features

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

8 BioMed Research International

20 40 60

1

09

08

07

06

05

04

03

02

01

0

Experiment round20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

20 40 60

Experiment round20 40 60

Experiment round

C4j v

alue

C4j v

alue

C6j v

alue

C6j v

alue

AEE1(4) AEE1(6)

AEE2(4) AEE2(6)

Figure 5 AEE1 and AEE2 plots of the bestworst feature in trigger detection

(all include original edge features) we get that the bestones for trigger features are 1amp2amp3amp4amp5 1amp2amp3amp41015840amp51amp2amp3amp4amp5amp61015840 and 1amp2amp3amp41015840amp5amp61015840 respectively whilethe 119865-score reaches 5134 5221 5193 and 5199 The newcombination result is listed in Supplementary MaterialAppendix D

The complete combination experiment shows that thebest combination of trigger feature is 1amp2amp3amp41015840amp5 withan 119865-score of 5221 Generally the best feature combinationcovers the major part of feature sets An interesting phe-nomenon in the best combination is the absence of the 6thor 61015840th feature class which indicates that this feature class isredundant and can be done without it

Similarly for feature selection in edge detection variousexperiments are carried out based on the best combination oftrigger features Here the best feature and the worst feature(2nd and 7th feature) are chosen to be modified in edgedetection and we denote the new feature classes as 21015840 and

71015840 With the fixed trigger feature 1amp2amp3amp41015840amp5 we test theclassifier with the original edge features 21015840 feature 71015840 featureand 21015840amp71015840 feature We obtain the best classifier in each com-bination experiment The best classifier in each combinationowns a feature set 1amp2amp4amp5amp6amp7amp8 1amp21015840amp4amp5amp6amp7amp81amp2amp4amp5amp6amp71015840amp8 or 1amp21015840amp4amp5amp6amp71015840amp8 separately andthe achieved 119865-score is 5216 5237 5247 or 5268 respec-tively

In the above experiments we test the performance oftrigger feature class by fixing edge features Likewise wetest edge features by fixing trigger features We observe thatfeature modifications in this phase are indeed capable ofachieving improvement where all of the best combinationsperform better than the result of the best trigger Finallywe use the best trigger feature (1amp2amp3amp41015840amp5) and bestedge feature (1amp21015840amp4amp5amp6amp71015840amp8) and eventually the bestcombination of feature set achieved the highest score of 5327which is better than the best performance of 5121 previously

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

BioMed Research International 9

20 40 60

1

09

08

07

06

05

04

03

02

01

0

1

09

08

07

06

05

04

03

02

01

0

Experiment round

C2j v

alue

1

09

08

07

06

05

04

03

02

01

0

C2j v

alue

C7j v

alue

1

09

08

07

06

05

04

03

02

01

0

C7j v

alue

80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

20 40 60

Experiment round80 100 120

AEE1(2) AEE1(7)

AEE2(2) AEE2(7)

Figure 6 AEE1 and AEE2 plots of the bestworst feature in edge detection

reported for TEES 20Therefore it is concluded that the bestclassifier has a feature set with trigger feature 1amp2amp3amp41015840amp5and edge feature 1amp21015840amp4amp5amp6amp71015840amp8 where trigger-119888 =250000 edge-119888 = 28000 and 120573 = 07 The final result is listedin Table 5

Comparing with the 24 participants in GENIA task ofBioNLP 2009 and historical progress of Bjornersquos work acontour performance is given in Figure 7

As Figure 6 indicates our system ranks the first amongthe 24 systems Comparisons of 119865-scores are listed in Table 6

4 Discussions and Conclusions

41 An Analysis of the Contribution of Features In thisresearch we designed a feature selection strategy AEE toevaluate the performance of individual feature classes toidentify the best performing feature sets An importantfinding is that the greatest contribution comes from the

content feature class in trigger detection In this section aroutine analysis of the contribution is shown which yieldsthe same finding and supports the same conclusion thatthe content feature class contributes the most towards eventrecognition and extraction

First retaining one feature class in the classifier we canget separate 119865-scores based on these features Dividing the119865-score by feature size we calculate the average contributionroughly This value partly reflects the contribution by featureclasses in terms of class size The result in Table 6 shows thatthe average contribution of the 4th feature is 0003 which isthe greatest score achieved by individual feature class

Second we observe all of the double combinations involv-ing the 119894th feature and observe that in most cases whenthe 119894th feature is combined with the 4th feature it reachesthe best performance score Even the worst performance ofthe double feature combinations involving the 4th featureperforms much better than the other settings

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

10 BioMed Research International

Table 5The best feature combination after choosing the best triggerfeature (1amp2amp3amp41015840amp5) and best edge feature (1amp21015840amp4amp5amp6amp71015840amp8)

Task 1 Recall Precision 119865-scoreStrict evaluation mode 4623 6284 5327Approximate span andrecursive mode 4969 6748 5724

Event decomposition in theapproximate span mode 5119 7321 6025

Task 2 Recall Precision 119865-scoreStrict evaluation mode 4490 6111 5177Approximate span andrecursive mode 4841 6581 5579

Event decomposition in theapproximate span mode 5052 7315 5976

Recall = TP(TP + FN) Precision = TP(TP + FP) and119865-score = 2((Recall times Precision)(Recall + Precision))

0 02 04 06 08 1

1

09

08

07

06

05

04

03

02

01

0

Precision

Reca

ll

TEES in Bjorne 2011

TEES in Bjorne 2009Other participants in

Our systemTEES101 baseline

ENIA task in BioNLP 2009

Figure 7 119865-score contour performance of participants in GENIAtaskThis119865-score is evaluated under approximate span and recursivemode Our current system is marked with a full triangular label

Third a similar phenomenon occurs in the caseof three-feature-combination experiment and four-feature-combination experiment In all cases when 119894thfeature is combined with the 4th feature it reaches thebest performance The same as before even the worstperformance of double feature combination involving the4th feature is much better than the other combinations SeeTable 7

Finally in yet another analysis we observe the caseswhere the 119894th feature is cancelled out The results show thatthe combination without the 4th class performs worst whichin turn confirms the importance of the 4th feature

Through the routine analysis above there is ample evi-dence arguing in support of the importance of the 4th featurein trigger detection Compared with the results in numerical

scores the contribution value of the 4th feature is greater thanthe others which confirms the judgment Furthermore wecan sort these features according to their contribution valuesThese results can further prove our decision tomodify the 4thfeature and thereafter enhance system performance

It is interesting that the routine analysis shows thesubstantial positive evaluation for the 4th trigger featurewhich is proved by the results of quantitative analysis ofAEE algorithm This shows a consistent tendency of featureimportance which in turn proves the reliability of the AEEalgorithm Since it is clumsy to use routine analysis to analyzeall of the features we expect that the AEEi algorithm makessense in generalized circumstances

42 Linguistic Analysis of Chosen Trigger Features and Impor-tance of Microlexical Feature The effectiveness of the 4thtrigger feature motivated the feature class modification byinserting ldquoftrdquo features One should also note that all thesefeatures in the 4th feature class are related to the spellingof word tokens which is similar to stems but containsmore abundant information than stems Besides we can alsoanalyze and ascertain the importance of other features likePOS information through feature combinations Here theedge features are fixed and only a smaller section of thetrigger features is tested through combination experiments

The full combinations are listed in Supplementary Mate-rial Appendix E and Table 8 lists the top 5 important featuresHere ldquostemrdquo and ldquononstemrdquo are the stem part or left part ofstem after using Porter Stemmer [28] It is also generated inmicrolexical level similar to ldquodtrdquo and ldquottrdquo The results in thistable show that the lexical feature generation rule affects thetoken information extraction in a decisive way

43 Strategy Discussion in Feature Selection under MachineLearning Strategy As a matter of fact trigger features couldbe analyzed according to their generation rule namelysentence feature main feature linear order feature contentfeature attached edge feature and chain feature This is astate-of-the-art strategy in feature selection TEES is a nicesystem based on machine learning which however does notperform intensive feature selection The absence of a featureselection strategy in previous researchmainly stems from tworeasons The first is that the natural core idea of machinelearning is just to put enough features into the classifier asa black box the second is that the performance of a classifierwith huge sizes of features is always better in accuracy and119865-score Previously the features used in TEES have beenmostly chosen by trial and error By adding some codes toproduce additional features and seeing how they impact theperformance of the system TEES always achieves a higher119865-score but with unsure directions This strategy introducesuseful but uncertain features and produces a large numberof features By introducing a feature selection and evaluationmethod the importance of different feature classes could beranked in a proper way which helps to identify the impor-tant features to be modified effectively Therefore in termsof the current research regarding the development of theTurku Event Extraction System we believe that our research

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

BioMed Research International 11

Table 6 Comparison of 119865-score performance in Task 1 with other systems (under primary criterion approximate span and recursive)

Bjorne et al 2009 [13] Bjorne et al 2011 [15] Bjorne et al 2012 [16] TEES101 Riedel et al 2011 [14] Ours119865-score () 5195 5286 5315 5565 5600 5724

Table 7 The best feature combination after choosing features dynamically in trigger and edge detection

1 2 3 4 5 6Feature in trigger Sentence feature Main feature Linear order Content feature Attached edge Chain feature

Feature FeatureFeature size 18998 24944 73744 8573 100561 178345(Merely one feature analysis)119865-score under one class 0 4205 350 2711 733 548Average contribution 0 0001 4119890 minus 5 0003 7119890 minus 5 3119890 minus 5

(Double feature combination analysis)Best performance (+4) (+4) (+4) (+1) (+2) (+4)Involving ith feature 4539 3665 2229 4539 2827 2348Worst performance (+2) (+1) (+1) (+3) (+1) (+1)Involving ith feature 0 0 0 2229 091 310

(Three-feature combination analysis)Best performance (+4 5) (+1 4) (+1 4) (+1 5) (+1 4) (+1 4)Involving ith feature 4944 4782 4513 4944 4944 4551Worst performance (+2 3) (+1 3) (+1 2) (+3 6) (+1 3) (+1 3)Involving ith feature 011 011 011 2076 198 240

(Four-feature combination analysis)Best performance (+2 4 5) (+1 4 5) (+1 4 5) (+1 2 5) (+1 2 4) (+1 2 4)Involving ith feature 5071 5071 4974 5071 5071 4882Worst performance (+2 3 6) (+1 3 6) (+1 2 6) (+3 5 6) (+1 2 3) (+1 2 3)Involving ith feature 577 577 577 2113 702 577

(Five-feature combination analysis)PerformanceWithout ith feature 3422 471 4990 1637 5019 5134

Table 8 Analysis of average contribution of lexical features

Feature Feature class 119865-score Feature size Average contribution1 Nonstem Main feature 643 154 0041752 POS Main feature 152 47 0032343 dt Content feature 2013 1172 0017184 tt Content feature 2763 7395 0003745 Stem Main feature 3545 11016 000322

reported in this paper is helpful for further improving thissystem We also believe that our strategy will also serve as anexample for feature selection in order to achieve enhancedperformance for machine learning systems in general

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

Research described in this paper was supported in part bygrant received from theGeneral Research Fundof theUniver-sity Grant Council of the Hong Kong Special AdministrativeRegion China (Project no CityU 142711) City Universityof Hong Kong (Project nos 6354005 7004091 96102837002793 9610226 9041694 and 9610188) the FundamentalResearch Funds for the Central Universities of China (Projectno 2013PY120) and theNationalNatural Science Foundation

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

12 BioMed Research International

of China (Grant no 61202305) The authors would also liketo acknowledge supports received from theDialogue SystemsGroup Department of Chinese Translation and Linguisticsand the Halliday Center for Intelligent Applications of Lan-guage Studies City University of Hong Kong The authorswish to express their gratitude to the anonymous reviewersfor their constructive and insightful suggestions

References

[1] R Zhang M J Cairelli M Fiszman et al ldquoUsing semanticpredications to uncover drugmdashdrug interactions in clinicaldatardquo Journal of Biomedical Informatics 2014

[2] W Liu K Miao G Li K Chang J Zheng and J C RajapakseldquoExtracting rate changes in transcriptional regulation fromMEDLINE abstractsrdquo BMCBioinformatics vol 15 2 p s4 2014

[3] B Shen ldquoTranslational biomedical informatics in the cloudpresent and futurerdquo BioMed Research International vol 2013Article ID 658925 8 pages 2013

[4] R Patra and S K Saha ldquoA kernel-based approach for biomedi-cal named entity recognitionrdquoThe Scientific World Journal vol2013 Article ID 950796 7 pages 2013

[5] P Jindal Information Extraction for Clinical Narratives Uni-versity of Illinois at Urbana-Champaign Champaign Ill USA2014

[6] T Hao A Rusanov M R Boland and C Weng ldquoClusteringclinical trials with similar eligibility criteria featuresrdquo Journal ofBiomedical Informatics 2014

[7] G K Mazandu and N J Mulder ldquoInformation content-basedgene ontology semantic similarity approaches toward a unifiedframework theoryrdquo BioMed Research International vol 2013Article ID 292063 11 pages 2013

[8] J Xia X Zhang D Yuan L Chen J Webster and A C FangldquoGene prioritization of resistant rice gene against Xanthomasoryzae pv oryzae by using text mining technologiesrdquo BioMedResearch International vol 2013 Article ID 853043 9 pages2013

[9] R Faiz M Amami and A Elkhlifi ldquoSemantic event extractionfrom biological texts using a kernel-basedmethodrdquo inAdvancesin Knowledge Discovery and Management pp 77ndash94 SpringerBerlin Germany 2014

[10] J D Kim T Ohta S Pyysalo Y Kano and J Tsujii ldquoOverviewof BioNLP rsquo09 shared task on event extractionrdquo in Proceedings ofNatural Language Processing in Biomedicine (BioNLP) NAACL2009 Workshop pp 1ndash9 2009

[11] J D Kim S Pyysalo T Dhta R Bossy N Nguyen and J TsujiildquoOverview of BioNLP shared task 2011rdquo in Proceedings of theBioNLP Shared Task 2011 Workshop pp 1ndash6 Portlan Ore USA2011

[12] J D Kim T Ohta and J Tsujii ldquoCorpus annotation for miningbiomedical events from literaturerdquo BMC Bioinformatics vol 9article 10 2008

[13] J Bjorne J Heimonen F Ginter A Airola T Pahikkalaand T Salakoski ldquoExtracting complex biological events withrich graph-based feature setsrdquo in Proceedings of the BioNLPWorkshop Companion Volume for Shared Task pp 10ndash18 2009

[14] S Riedel D Mcclosky M Surdeanu A Mccallum and C DManning ldquoModel combination for event extraction in BioNLPrdquoin Proceedings of the BioNLP Shared Task Workshop pp 51ndash552011

[15] J Bjorne J Heimonen F Ginter A Airola T Pahikkala and TSalakoski ldquoExtracting contextualized complex biological eventswith rich graph-based feature setsrdquo Computational Intelligencevol 27 no 4 pp 541ndash557 2011

[16] J Bjorne G Filip and T Salakoski ldquoUniversity of Turku in theBioNLP rsquo11 Shared Taskrdquo BMC Bioinformatics vol 13 11 p s42012

[17] J Bjorne S V Landeghem S Pyysalo et al ldquoScale eventextraction for post translational modifications epigenetics andprotein structural relationsrdquo Proceedings of BioNLP pp 82ndash902012

[18] J Bjorne F Ginter S Pyysalo J Tsujii and T SalakoskildquoScaling up biomedical event extraction to the entire PubMedrdquoin Proceedings of the 2010 Workshop on Biomedical NaturalLanguage Processing (ACL rsquo10) pp 28ndash36 2010

[19] J Bjorne and T Salakoski ldquoGeneralizing biomedical eventextractionrdquo in Proceedings of the BioNLP shared TaskWorkshoppp 183ndash191 2011

[20] J Bjorne and T Salakoski ldquoTEES 21 automated annotationscheme learning in the BioNLP Shared Taskrdquo inAssociation ForComputational Linguistics (ACL rsquo13) pp 16ndash25 2013

[21] J Zou C Lippert D Heckerman M Aryee and J ListgartenldquoEpigenome-wide association studies without the need for cell-type compositionrdquo inNature Methods vol 11 pp 309ndash311 2014

[22] S Fong K Lan and R Wong ldquoClassifying human voicesby using hybrid SFX time-series preprocessing and ensemblefeature selectionrdquo BioMed Research International vol 2013Article ID 720834 27 pages 2013

[23] DWang L Yang Z Fu and J Xia ldquoPrediction of thermophilicproteinwith pseudo amino acid composition an approach fromcombined feature selection and reductionrdquo Protein and PeptideLetters vol 18 no 7 pp 684ndash689 2011

[24] S O Cao and J H Manton ldquoUnsupervised optimal discrim-inant vector based feature selection methodrdquo MathematicalProblems in Engineering vol 2013 Article ID 396780 7 pages2013

[25] L Carlos Molina L Belanche and A Nebot ldquoFeature selectionalgorithms a survey and experimental evaluationrdquo in Proceed-ings of the IEEE International Conference onDataMining (ICDMrsquo02) pp 306ndash313 December 2002

[26] C Yang W Zhang J Zou S Hu and J Qiu ldquoFeature selectionin decision systems a mean-variance approachrdquo MathematicalProblems in Engineering vol 2013 Article ID 268063 8 pages2013

[27] S van Landeghem T Abeel Y Saeys and Y van de PeerldquoDiscriminative and informative features for biomolecular textmining with ensemble feature selectionrdquo Bioinformatics vol 26no 18 Article ID btq381 pp i554ndashi560 2010

[28] DM Powers ldquoEvaluation fromprecision recall andF-measureto ROC informedness markedness amp correlationrdquo Journal ofMachine Learning Technologies vol 2 no 1 pp 37ndash63 2011

[29] M F Porter ldquoAn algorithm for suffix strippingrdquo Program vol14 no 3 pp 130ndash137 1980

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Research Article A Novel Feature Selection Strategy for ...downloads.hindawi.com/journals/bmri/2014/205239.pdf · task, bacteria biotope task, and bacteria interaction task [ ]. As

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology