13
Research Article Supervised Learning Based Hypothesis Generation from Biomedical Literature Shengtian Sang, Zhihao Yang, Zongyao Li, and Hongfei Lin College of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China Correspondence should be addressed to Zhihao Yang; [email protected] Received 16 January 2015; Revised 12 April 2015; Accepted 24 May 2015 Academic Editor: Hung-Yu Kao Copyright © 2015 Shengtian Sang et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Nowadays, the amount of biomedical literatures is growing at an explosive speed, and there is much useful knowledge undiscovered in this literature. Researchers can form biomedical hypotheses through mining these works. In this paper, we propose a supervised learning based approach to generate hypotheses from biomedical literature. is approach splits the traditional processing of hypothesis generation with classic ABC model into AB model and BC model which are constructed with supervised learning method. Compared with the concept cooccurrence and grammar engineering-based approaches like SemRep, machine learning based models usually can achieve better performance in information extraction (IE) from texts. en through combining the two models, the approach reconstructs the ABC model and generates biomedical hypotheses from literature. e experimental results on the three classic Swanson hypotheses show that our approach outperforms SemRep system. 1. Introduction Literature-based discovery (LBD) was pioneered by Swanson in the 1980s and it focuses on finding new relationships in existing knowledge from unrelated literatures and provides logical explanations [13]. Swanson’s method is to find a bridge that links two conceptually related topics that ought to have been studied together but never have been. For instance, in his initial work [1], Swanson found there are some types of blood disorders in patients with Raynaud’s phenomenon (A), such as high blood viscosity (B), and at the same time he found that fish oil (C) can reduce blood viscosity (B) in other literatures. Since no literature had proved the relationship between fish oil (C) and Raynaud’s phenomenon (A) at that time, Swanson proposed that fish oil (C) might treat Raynaud’s phenomenon (A), and this hypothesis was verified in medical experiments two years later. In this hypothesis, the topic of blood viscosity (B) served as a bridge between the topics of Raynaud’s phenomenon (A) and dietary fish oil (C). Swanson summarized this method as ABC model and he has published several other medical discoveries using this methodology [2, 3]. Since Swanson reported that literature-based discoveries are actually possible, many works have contributed more advanced and automated methods for LBD. Most of the early LBD researches adopted information retrieval (IR) techniques to illustrate the effectiveness of the ABC model for LBD. e idea behind these methods is that the higher cooccurrence frequency the two concepts A and B have, the more related they are. By using the statistical characteristics, the ABC model is automatically achieved. Weeber et al. used concept cooccurrence as a relation measure and applied UMLS semantic types for filtering [4, 5]. For example, the semantic type of one of the cooccurring concepts might be set to disease or syndrome and the other to phar- macologic substance; thus only cooccurrences between a disease and a drug are found. Srinivasan [6] developed a system, called Manjal, which uses Medical Subject Headings (MeSH) terms as concepts and term weights instead of simple term frequencies. e system uses an information retrieval measure based on term cooccurrence for ranking. Yetisgen- Yildiz and Pratt [7] developed a system, called LitLinker, which incorporates knowledge-based methodologies with a statistical method. LitLinker marks the terms with z- scores and the terms which are larger than a predefined threshold as the correlated terms to the starting or link- ing term, and the authors evaluate LitLinker’s performance by adopting the information retrieval metrics of precision Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 698527, 12 pages http://dx.doi.org/10.1155/2015/698527

Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

Research ArticleSupervised Learning Based Hypothesis Generation fromBiomedical Literature

Shengtian Sang Zhihao Yang Zongyao Li and Hongfei Lin

College of Computer Science and Engineering Dalian University of Technology Dalian 116024 China

Correspondence should be addressed to Zhihao Yang yangzhdluteducn

Received 16 January 2015 Revised 12 April 2015 Accepted 24 May 2015

Academic Editor Hung-Yu Kao

Copyright copy 2015 Shengtian Sang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Nowadays the amount of biomedical literatures is growing at an explosive speed and there is much useful knowledge undiscoveredin this literature Researchers can form biomedical hypotheses through mining these works In this paper we propose a supervisedlearning based approach to generate hypotheses from biomedical literature This approach splits the traditional processing ofhypothesis generation with classic ABC model into AB model and BC model which are constructed with supervised learningmethod Compared with the concept cooccurrence and grammar engineering-based approaches like SemRep machine learningbased models usually can achieve better performance in information extraction (IE) from texts Then through combining the twomodels the approach reconstructs the ABC model and generates biomedical hypotheses from literature The experimental resultson the three classic Swanson hypotheses show that our approach outperforms SemRep system

1 Introduction

Literature-based discovery (LBD) was pioneered by Swansonin the 1980s and it focuses on finding new relationships inexisting knowledge from unrelated literatures and provideslogical explanations [1ndash3] Swansonrsquos method is to find abridge that links two conceptually related topics that ought tohave been studied together but never have been For instancein his initial work [1] Swanson found there are some typesof blood disorders in patients with Raynaudrsquos phenomenon(A) such as high blood viscosity (B) and at the same time hefound that fish oil (C) can reduce blood viscosity (B) in otherliteratures Since no literature had proved the relationshipbetween fish oil (C) and Raynaudrsquos phenomenon (A) atthat time Swanson proposed that fish oil (C) might treatRaynaudrsquos phenomenon (A) and this hypothesis was verifiedin medical experiments two years later In this hypothesisthe topic of blood viscosity (B) served as a bridge betweenthe topics of Raynaudrsquos phenomenon (A) and dietary fish oil(C) Swanson summarized this method as ABC model andhe has published several other medical discoveries using thismethodology [2 3]

Since Swanson reported that literature-based discoveriesare actually possible many works have contributed more

advanced and automated methods for LBD Most of theearly LBD researches adopted information retrieval (IR)techniques to illustrate the effectiveness of the ABC modelfor LBD The idea behind these methods is that the highercooccurrence frequency the two concepts A and B have themore related they are By using the statistical characteristicsthe ABC model is automatically achieved Weeber et alused concept cooccurrence as a relation measure and appliedUMLS semantic types for filtering [4 5] For example thesemantic type of one of the cooccurring concepts mightbe set to disease or syndrome and the other to phar-macologic substance thus only cooccurrences between adisease and a drug are found Srinivasan [6] developed asystem called Manjal which uses Medical Subject Headings(MeSH) terms as concepts and termweights instead of simpleterm frequencies The system uses an information retrievalmeasure based on term cooccurrence for ranking Yetisgen-Yildiz and Pratt [7] developed a system called LitLinkerwhich incorporates knowledge-based methodologies witha statistical method LitLinker marks the terms with z-scores and the terms which are larger than a predefinedthreshold as the correlated terms to the starting or link-ing term and the authors evaluate LitLinkerrsquos performanceby adopting the information retrieval metrics of precision

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 698527 12 pageshttpdxdoiorg1011552015698527

2 BioMed Research International

and recall However all the methods mentioned above aremainly based on statistical characters The main issue ofthe hypothesis generation approach based on cooccurrenceis that the extracted relationships lack logical explanationsOn one side some extracted pairs of entities are completelyuncorrelated actually but the high cooccurrence frequencyshows a strong association between them On the other sidealthough two entities have strong semantic correlations theymight not be extracted from literature because of their lowcooccurrence frequency in literature In addition althoughthese approaches based on cooccurrence succeed in findingintermediate (B) concepts they provide no insight into thenature of the relationships among such concepts [8]

Recently Hu et al [9] presented a system called Bio-SARS (Biomedical Semantic-Based Association Rule Sys-tem) which utilizes both semantic information and associ-ation rules to expand the number of semantic types proposedby Weeber and this system achieves better performanceMiyanishi et al advanced the concept of semantic simi-larity between events based on semantic information [10]Hristovski et al combined two natural language processingsystems SemRep and BioMedLee to provide predicationsand analysis using predications can support an explana-tion of potential discoveries [11] Cohen et al proposedthe Predication-Based Semantic Indexing (PSI) approach tosearch predications extracted from the biomedical litera-ture by the SemRep system [12] Cameron et al presenteda methodology that leverages the semantics of assertionsextracted from biomedical literature (called semantic predi-cations) along with structured background knowledge- andgraph-based algorithms to semiautomatically capture theinformative associations originally discovered manually bySwanson [8] However all the above hypothesis generationapproaches based on semantic information and associationrules utilize the semantic extraction tool SemRep systemAnd the performance of SemRep is not perfect its precisionrecall and F-score are 073 055 and 063 respectively [13]On one side low recall (55) means a substantial number ofsemantic associations between entities will be missing andon the other side low precision (73) means many falsesemantic associations will be returned

The aim of our work is to find an effective method toextract semantic relationships from biomedical literatureIn this paper we propose a supervised learning basedapproach to generate hypotheses from biomedical literatureThis approach divides the traditional hypothesis generationmodel ABCmodel into twomachine learning basedmodelsAB and BC models The AB model is used to determinewhether a physiological phenomenon (linking term B) iscaused by a disease (initial term A) in a sentence and theBCmodel is used to determine whether there exists an entity(eg pharmacologic substance target term C) having phys-iological effects (linking term B) in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRepmachine learning basedmod-els usually can achieve better performance in informationextraction (IE) from texts [14] In our experiments theperformances of AB and BC models (both are more than076 measured in F-score) are much better than that of

A

B1

B2

B3

B3

C1

C2

C3

C4

C5

Cn

Bn

Figure 1 Open discovery process a one direction search processwhich starts at A and results in C

SemRep (063 in F-score [13]) Then through combiningthe two models the approach reconstructs the ABC modeland generates biomedical hypotheses from literatures Theexperimental results on the three classic Swanson hypothesesalso show that our approach achieves better performancethan the SemRep system

2 Related Resources and Tools

21 Open and Closed Discovery Weeber et al summa-rized that the hypothesis-generating approaches proposedby Swanson are considered open discovery and the testingapproaches are considered closed discovery [4]

Open Discovery The process of open discovery is charac-terized by the generation of a hypothesis Figure 1 depictsthe open discovery approach beginning with disease A Theresearcher will try to find interesting clues (B) typicallyphysiological processes which play a role in the diseaseunder scrutiny Next he (or she) tries to identify C-termstypically substances which act on the selected Bs As a resultof the process the researcher may form the hypothesis thatsubstance C

119899can be used for the treatment of disease A via

pathway B119899

Closed Discovery A closed discovery process is the testingof a hypothesis If the researcher has already formed ahypothesis possibly by the open discovery route describedabove he (or she) can elaborate and test it from the literatureFigure 2 depicts the approach starting from both diseaseA and substance C the researcher tries to find commonintermediate B terms

22 MetaMap and Semantic Type MetaMap is a highlyconfigurable application developed by the ListerHill NationalCenter for Biomedical Communications at the National

BioMed Research International 3

A C

B1

B2

B3

B3

Bn

Figure 2 Closed discovery process the process starts simultane-ously from A and C resulting in overlapping Bs

Library of Medicine (NLM) to map biomedical text to theUMLS Metathesaurus or equivalently to identify Metathe-saurus concepts referred to in English text Because everyconcept in the Metathesaurus has been assigned one or moresemantic types we filter the results of the text-to-conceptmapping process by means of the semantic types In thedifferent stages of the process we employ different semanticfilters For example in the stage of selecting intermediate Bterms we choose the terms with functional semantic typessuch as biologic function cell function phenomenon orprocess and physiologic function When selecting dietaryfactors as A terms we choose the concepts with functionalsemantic types such as vitamin lipid and element ion orisotope Table 1 provides the functional semantic types thatwe use to filter the linking terms and target terms in ourexperiments

23 SemRep and SemMedDB SemRep was developed at theNational Library of Medicine and is a program that extractssemantic predications (subject-relation-object triples) frombiomedical free text For example from the sentence inTable 2 SemRep extracts four predications (as shown inTable 2) Semantic Medline Database (SemMedDB) is arepository of semantic predications extracted by SemRepSemMedDB currently contains information about approxi-mately 70 million predications from all of PubMed citations(more than 235 million citations as of April 1 2014) andforms the backbone of Semantic Medline application [15]

24 General Concepts Filtering Many general concepts willbe generated after the processing of Named Entity Recog-nition (NER) by MetaMap General concepts refer to theterms which have relationships with many entities but havemeaningless concept such as ldquodiseaserdquo Because the existenceof general concepts may reduce the effect of knowledgediscovery they need to be filtered In our experiment first weextract all the sentences with relations such as ldquosubject ISAobjectrdquo from SemMedDB and then a general concept list is

Table 1 The semantic type filter used in our experiments

Linking term Target termBiologic function Organic chemicalCell function LipidFinding Pharmacologic substanceMolecular function VitaminOrganism function Element ion or isotopeOrgan or tissue functionPathologic functionPhenomenon or processPhysiologic function

constructed by collecting all the objects from the sentencesAll the relations used in our experiment are PART OFLOCATION OF and ISA

25 Stanford Parser The Stanford Parser was built by theStanford NLP (natural language processing) Group in the1990s It is a program that works out the grammaticalstructure of sentences for instance which groups of wordsgo together (as ldquophrasesrdquo) and which words are the subject orobject of a verbTheprimary function of the StanfordParser isto analyze and extract the syntactic structure of sentences andpart-of-speech tagging (POS tagging) The parser providesStanford Dependencies output as well as phrase structuretrees In our method the features of the graph kernel-basedmethod are extracted by the Stanford Parser [17]

3 Method

For knowledge discovery we split the traditional ABCmodelinto twomodelsmdashABmodel and BCmodel Both models areconstructed by using cotraining methodsThe purpose of ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether thereexists an entity (target term) having physiological effects(linking term) on human beings in a sentence

The supervised learning methods used in our experi-ment are all kernel-based methods and kernel methods areeffective alternatives to explicit feature extraction [18] Theyretain the original representation of objects and use the objectonly via computing a kernel function between a pair ofobjects Such a kernel function makes it possible to computethe similarity between objects without enumerating all thefeatures And the features we employ in our experiment areas follows

31 Feature Sets A kernel can be thought of as a similarityfunction for pairs of objects The following features are usedin our feature-based kernel

(i) Neighboring Words In a sentence the words surroundingtwo concepts have a significant impact on the existence ofthe relationship between the two concepts In our methodthe words surrounding two concepts are considered the

4 BioMed Research International

Table 2 The semantic type filter used in our experiments

Sentence We used hemofiltration to treat a patient with digoxin overdose that was complicatedby refractory hyperkalemia

Extracted predications(subject-relation-object triples)

Hemofiltration-TREATS-PatientsDigoxin overdose-PROCESS OF-PatientsHyperkalemia-COMPLICATES-Digoxin overdoseHemofiltration-TREATS(INFER)-Digoxin overdose

neighboring words feature This feature consists of all thewords that are located between the two concepts wordssurrounding two protein names which include three wordsto the left of the first protein name and three words to theright of the second protein name We add a prefix to eachword to distinguish the different positions of the words Forexample the word ldquowordrdquo is expressed in the above threecases as ldquom wordrdquo ldquol wordrdquo and ldquor wordrdquo respectivelyWhenusing the neighboring word feature a dictionary of the wholecorpus will be established and a sentence is represented asa Boolean vector where ldquo1rdquo means the feature exists in thesentences and ldquo0rdquo means the feature does not exist in thesentence

(ii) Entity Name Distance Under the assumption that theshorter the distance (the number of words) between twoentity names is the more likely the two proteins haveinteraction relation the distance is chosen as a feature Thefeature value is set as the number of words between two entitynames

(iii) Relationship Words Through analyzing the corpus wemake the hypothesis if there exists a relationship betweentwo entity concepts there will be a greater probability thatsome verbs or their variants appear surrounding the twoconcepts such as ldquoactivationrdquo ldquoinducerdquo and ldquomodulaterdquoThese words are used as relationship words in our methodWe build a relationship words list of about 500 and the listis used to determine whether there is a relationship betweenthe two entity concepts in the sentences Boolean ldquo1rdquo meansthe relationship word exists in the sentence and ldquo0rdquo means itdoes not exist

(iv) Negative Words Some negative words such as ldquonotrdquoldquoneitherrdquo and ldquonordquo exist in some sentences and these negativewords express that there is no relationship between two entityconcepts If a negative word and a relationship word cooccurin a sentence it is difficult to judge whether there existsa relationship in the sentences only based on relationshipwords so negativewords features were introduced to improvethe situation Boolean ldquo1rdquo means the negative word exists inthe sentence and ldquo0rdquo means it does not exist

For example in sentence A ldquoQuercetin one of themost representative C0596577 compounds is involved inantiradical C0003402 and prooxidant biological processesrdquoall the features extracted from it are shown in Table 3 (wepreprocessed sentence A with MetaMap and the two targetentities are represented by their CUIs) Finally the sentencewill be represented by a feature vector

Table 3 Features extracted from example sentence A

Feature names Feature valuesLeft words l the l most l representativeWords between twoentity names

m compounds m is m involved m inm antiradical

Right words r and r prooxidant r biologicalRelationship words involvedProtein name distance 5Negative word mdash

32 Graph Kernel In our experiment all the sentencesare parsed by Stanford Parser to generate the output ofdependency path and POS path A graph kernel calculatesthe similarity between two input graphs by comparing therelations between commonvertices (nodes)Thegraph kernelused in our method is the all-paths graph kernel proposed byAirola et al [19] The kernel represents the target pair usinggraph matrices based on two subgraphs where the graphfeatures include all nonzero elements in the graph matricesThe two subgraphs are a parse structure subgraph (PSS) anda linear order subgraph (LOS) as shown in Figure 3 PSSrepresents the parse structure of a sentence and includeswordor link vertices A word vertex contains its lemma and itsPOS while a link vertex contains its link Additionally bothtypes of vertices contain their positions relative to the shortestpath LOS represents the word sequence in the sentence andthus has word vertices each of which contains its lemma itsrelative position to the target pair and its POS

For the calculation two types of matrices are used a labelmatrix 119871 and an edge matrix 119860 The label matrix is a (sparse)119873times119871matrix where119873 is the number of vertices (nodes) and119871 is the number of labels It represents the correspondencebetween labels and vertices where 119871

119894119895is equal to 1 if the 119894th

vertex corresponds to the 119895th label and 0 otherwiseThe edgematrix is a (sparse)119873times119873matrix and represents the relationbetween pairs of vertices where 119860

119894119895is a weight 119908

119894119895if the 119894th

vertex is connected to the 119895th vertex and 0 otherwise Theweight is a predefined constant whereby the edges on theshortest paths are assigned a weight of 09 and other edgesreceive a weight of 03 Using the Neumann Series a graphmatrix 119866 is calculated as

119866 = 119871119879

infin

sum

119899=1119860119899119871 = 119871

119879

((119868 minus119860)minus1

minus 119868) 119871 (1)

This matrix represents the sums of all the path weightsbetween any pair of vertices resulting in entries representing

BioMed Research International 5

PROT1_IPNN_IP

interactsVBZ

withIN

CONCEPTNN

toTO

disassemble_IPVB_IP

PROT2_IPNN_IP

filaments_IPNNS_IP

PROT1NN

Interacts_MVBZ_M

with_MIN_M

CONCEPT_MNN_M

to_MTO_M

disassemble_MVB_IP_M

C2NN

filaments_ANNS_A

09 09 09 09 09 09 09

nsubj03 03 03 03

03

prep_with aux 0303 nmod_IP0909

dobj_IP0909

xubj_IP 0909xcomp 03

Figure 3 Graph representation generated from an example sentence The candidate interaction pair is marked as PROT1 and PROT2 thethird protein is marked as PROT The shortest path between the proteins is shown in bold In the dependency based subgraph all nodes in ashortest path are specialized using a posttag (IP) In the linear order subgraph possible tags are before (B) middle (M) and after (A) For theother two candidate pairs in the sentence graphs with the same structure but different weights and labels would be generated

the strength of the relation between each pair of verticesUsing two input graph matrices 119866 and 1198661015840 the graph kernel119896(119866 119866

1015840

) is the sum of the products of the common relationsrsquoweights given by

119896 (119866 1198661015840

) =

119871

sum

119894=1

119871

sum

119895=11198661198941198951198661015840

119894119895 (2)

33 Cotraining Algorithm Cotraining is a semisupervisedlearning algorithm used when there are only small amountsof labeled data and large amounts of unlabeled data and itrequires two views of the data [20] It assumes that each exam-ple is described using two different feature sets that providedifferent complementary information about the instanceCotraining first learns a separate classifier for each view usingany labeled examplesThemost confident predictions of eachclassifier on the unlabeled data are then used to iterativelyconstruct additional labeled training data In our experimentthe two different feature sets used to describe a sentenceare word features and grammatical features (graph kernelextracted by Stanford Parser) respectively

4 Classification Models

41 Evaluation Metrics In our study the evaluation metricsprecision (119875) recall (119877) and119865-score (119865) are employed whichare defined as follows

119875 =TP

TP + FPlowast 100

119877 =TP

TP + FNlowast 100

119865 =2 lowast 119875 lowast 119877119875 + 119877

lowast 100

(3)

TP is the number of correctly predicted pieces of data FPis the number of false positives and FN is the number of falsenegatives in the test set 119875 is used to evaluate the accuracy

of a model 119877 is recall rate and 119865 is the harmonic mean ofprecision and recall rates

In addition we define the effective linking terms as theterms in a sentence which can connect the initial term andthe target term For example in the case of Raynaudrsquos diseaseand fish oil the concepts of platelet aggregation blood vesselcontraction are all effective linking termsThedefinition of theproportion of the effective linking terms in all linking termsis as follows

119878 =119899

119873lowast 100 (4)

where 119899 is the number of effective linking terms and119873 is thenumber of all the terms which have relationships with initialterms

42 Training AB Model The purpose of the AB model isto determine whether a physiological phenomenon (linkingterm) is caused by a disease (initial term) in a sentence

We obtain all the sentences (the corpus) used inour experiment through searching from Semantic MedlineDatabase with 200 different semantic types about ldquodiseaseor syndromerdquo defined by MeSH (Medical Subject Headings)Then all the sentences are processed with the MetaMap forNamed Entity Recognition (NER) and to limit the semantictypes of initial terms and linking terms in a sentence we filterout the sentenceswhich contain the concepts of the unneededsemantic types Finally we obtain a total of 20895 sentencesand we randomly choose only 1000 sentences (probably 5of all the sentences) for manual annotation then we buildtwo initial labeled data sets as initial training set Tinitial (500labeled sentences) and test set (other 500 labeled sentences)respectively There are two reasons why we randomly choose1000 sentences for manual annotation first manual anno-tation is very time consuming and expensive and at thesame time unlabeled training data are usually much easier toobtain This is also why we introduce the cotraining methodto improve the performance of our experiment And the otherreason is when cotraining style algorithms are executed the

6 BioMed Research International

Table 4 Annotation values of two annotators

Reviewer 2 Reviewer 1Positive Negative Kappa score

Positive 385 22Negative 43 550Ours 08664Light et al [16] 068

number of labeled examples is usually small [21] so we selectabout 5 of the corpus as training set and test set and theother 95 of the corpus is used for extending the training setlater

During the manual annotation the following criteria aremet if a sentence contains a relationship belonging to thesemantic type list (see Table 1) between two entity conceptswe think there is a positive correlation between the two enti-ties and label the sentence as a positive example In additionsome special relationships such as ldquoB in Ardquo and ldquoA can changeBrdquo are also classified as positive examples since they mean aphysiological phenomenon (B) occurs when someone has thedisease (A) If there is no obvious relationship in a sentenceand only a cooccurrence relationship we label it as a negativeexample For the patterns such as ldquoA is a Brdquo and ldquoA and Brdquowe label them as negative examples since ldquoA is a Brdquo is ldquoIS Ardquorelationship and ldquoA and Brdquo is a coordination relationshipand they are not the relationship we need When the processwas completed we estimated the level of agreement Cohenrsquoskappa [22] score between each annotator as shown in Table 4is 08664 and content analysis researchers generally think ofa Cohenrsquos kappa score more than 08 as good reliability [22]Comparably Light et al achieved their highest kappa value(068) in their manual annotation experiments on Medlineabstracts [16]

As shown in Figure 4 the process of training AB modelis as follows at first two initial SVM classifiers M0

1 (graph-based kernel) and M0

2 (feature-based kernel) are trainedusing initial training set Tinitial which contains 500 labeledexamples and the test results of two classifiers are as followswith classifier M0

1 the values of precision recall and F-scoreare 7288 8333 and 7776 respectively And at the sametime the results of classifier M0

2 are 7435 7633 and7533 respectively

In the next step 2000 unlabeled sentences (a section ofthe unlabeled corpus) are predicated by classifiers M0

1 andM0

2 respectively The classifier M01 selects the top 200 scores

(10of 2000 results classified byM01) as the positive examples

and the bottom 200 scores as the negative examples then weput these 400 newly labeled examples into the data set Tinitial(500) as a new training set T1

2 (900 examples) And at the sametime the classifierM0

2 selects the top 200 scores (10 of 2000results classified by M0

2) as the positive examples and thebottom 200 scores as the negative examples thenwe put these400 newly labeled examples into the data set Tinitial (500) as anew training set T1

1 (900 examples) Then two new classifiersM1

1 and M12 are trained from T1

1 and T12 respectively After

that 3000 unlabeled sentences (a section of the unlabeled

3000

6000

unlabeledsentences

unlabeledsentences

2000 unlabeledsentences PredicatePredicate

Predicate Predicate

400 pieces of labeled data400 pieces of labeled data

1200 pieces of labeled data1200 pieces of labeled data

Tinitial Tinitial

M0

1 M0

2

T1

2

M5

1 M5

2

T5

2

T5

1

T1

1

M1

1 M1

2

Figure 4 The process of training AB model

corpus) are classified by M11 and M1

2 respectively M11 selects

the top 300 scores (10 of 3000 results classified by M11)

as positive examples and the bottom 300 scores as negativeexamples We use the 600 newly labeled examples to updatethe T1

2 and thenwe obtain a new training set T22 whileM

12 also

selects 600 newly labeled examples to update T11 and then we

obtain the new training set T21 Such a process is repeated for

a preset number of learning roundsIn our experiment the preset number of learning rounds

is five and the sizes of corresponding extended set are 20003000 4000 5000 and 6000 respectively And the size oftraining set increases from the initial 500 (Tinitial) to 900(T1

1 and T12) 1500 (T2

1 and T22) 2300 (T3

1 and T32) 3300

(T41 and T4

2) and 4500 (T51 and T5

2) respectively Thereare two reasons why we set five times of learning roundsFirst after five learning rounds there are no extra unlabeledsentences At first 95 of the corpus (20895) is used forextending the training set and after five times of expansionof the training set there are no extra unlabeled data And thesecond reason is the cotraining process could not improvethe performance further after a number of rounds [21] andwe should terminate cotraining on an appropriate round toavoid some wasteful learning rounds

The best testing result is from the second learning round(the classifier is M2

1 (graph kernel) and the training set is

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

2 BioMed Research International

and recall However all the methods mentioned above aremainly based on statistical characters The main issue ofthe hypothesis generation approach based on cooccurrenceis that the extracted relationships lack logical explanationsOn one side some extracted pairs of entities are completelyuncorrelated actually but the high cooccurrence frequencyshows a strong association between them On the other sidealthough two entities have strong semantic correlations theymight not be extracted from literature because of their lowcooccurrence frequency in literature In addition althoughthese approaches based on cooccurrence succeed in findingintermediate (B) concepts they provide no insight into thenature of the relationships among such concepts [8]

Recently Hu et al [9] presented a system called Bio-SARS (Biomedical Semantic-Based Association Rule Sys-tem) which utilizes both semantic information and associ-ation rules to expand the number of semantic types proposedby Weeber and this system achieves better performanceMiyanishi et al advanced the concept of semantic simi-larity between events based on semantic information [10]Hristovski et al combined two natural language processingsystems SemRep and BioMedLee to provide predicationsand analysis using predications can support an explana-tion of potential discoveries [11] Cohen et al proposedthe Predication-Based Semantic Indexing (PSI) approach tosearch predications extracted from the biomedical litera-ture by the SemRep system [12] Cameron et al presenteda methodology that leverages the semantics of assertionsextracted from biomedical literature (called semantic predi-cations) along with structured background knowledge- andgraph-based algorithms to semiautomatically capture theinformative associations originally discovered manually bySwanson [8] However all the above hypothesis generationapproaches based on semantic information and associationrules utilize the semantic extraction tool SemRep systemAnd the performance of SemRep is not perfect its precisionrecall and F-score are 073 055 and 063 respectively [13]On one side low recall (55) means a substantial number ofsemantic associations between entities will be missing andon the other side low precision (73) means many falsesemantic associations will be returned

The aim of our work is to find an effective method toextract semantic relationships from biomedical literatureIn this paper we propose a supervised learning basedapproach to generate hypotheses from biomedical literatureThis approach divides the traditional hypothesis generationmodel ABCmodel into twomachine learning basedmodelsAB and BC models The AB model is used to determinewhether a physiological phenomenon (linking term B) iscaused by a disease (initial term A) in a sentence and theBCmodel is used to determine whether there exists an entity(eg pharmacologic substance target term C) having phys-iological effects (linking term B) in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRepmachine learning basedmod-els usually can achieve better performance in informationextraction (IE) from texts [14] In our experiments theperformances of AB and BC models (both are more than076 measured in F-score) are much better than that of

A

B1

B2

B3

B3

C1

C2

C3

C4

C5

Cn

Bn

Figure 1 Open discovery process a one direction search processwhich starts at A and results in C

SemRep (063 in F-score [13]) Then through combiningthe two models the approach reconstructs the ABC modeland generates biomedical hypotheses from literatures Theexperimental results on the three classic Swanson hypothesesalso show that our approach achieves better performancethan the SemRep system

2 Related Resources and Tools

21 Open and Closed Discovery Weeber et al summa-rized that the hypothesis-generating approaches proposedby Swanson are considered open discovery and the testingapproaches are considered closed discovery [4]

Open Discovery The process of open discovery is charac-terized by the generation of a hypothesis Figure 1 depictsthe open discovery approach beginning with disease A Theresearcher will try to find interesting clues (B) typicallyphysiological processes which play a role in the diseaseunder scrutiny Next he (or she) tries to identify C-termstypically substances which act on the selected Bs As a resultof the process the researcher may form the hypothesis thatsubstance C

119899can be used for the treatment of disease A via

pathway B119899

Closed Discovery A closed discovery process is the testingof a hypothesis If the researcher has already formed ahypothesis possibly by the open discovery route describedabove he (or she) can elaborate and test it from the literatureFigure 2 depicts the approach starting from both diseaseA and substance C the researcher tries to find commonintermediate B terms

22 MetaMap and Semantic Type MetaMap is a highlyconfigurable application developed by the ListerHill NationalCenter for Biomedical Communications at the National

BioMed Research International 3

A C

B1

B2

B3

B3

Bn

Figure 2 Closed discovery process the process starts simultane-ously from A and C resulting in overlapping Bs

Library of Medicine (NLM) to map biomedical text to theUMLS Metathesaurus or equivalently to identify Metathe-saurus concepts referred to in English text Because everyconcept in the Metathesaurus has been assigned one or moresemantic types we filter the results of the text-to-conceptmapping process by means of the semantic types In thedifferent stages of the process we employ different semanticfilters For example in the stage of selecting intermediate Bterms we choose the terms with functional semantic typessuch as biologic function cell function phenomenon orprocess and physiologic function When selecting dietaryfactors as A terms we choose the concepts with functionalsemantic types such as vitamin lipid and element ion orisotope Table 1 provides the functional semantic types thatwe use to filter the linking terms and target terms in ourexperiments

23 SemRep and SemMedDB SemRep was developed at theNational Library of Medicine and is a program that extractssemantic predications (subject-relation-object triples) frombiomedical free text For example from the sentence inTable 2 SemRep extracts four predications (as shown inTable 2) Semantic Medline Database (SemMedDB) is arepository of semantic predications extracted by SemRepSemMedDB currently contains information about approxi-mately 70 million predications from all of PubMed citations(more than 235 million citations as of April 1 2014) andforms the backbone of Semantic Medline application [15]

24 General Concepts Filtering Many general concepts willbe generated after the processing of Named Entity Recog-nition (NER) by MetaMap General concepts refer to theterms which have relationships with many entities but havemeaningless concept such as ldquodiseaserdquo Because the existenceof general concepts may reduce the effect of knowledgediscovery they need to be filtered In our experiment first weextract all the sentences with relations such as ldquosubject ISAobjectrdquo from SemMedDB and then a general concept list is

Table 1 The semantic type filter used in our experiments

Linking term Target termBiologic function Organic chemicalCell function LipidFinding Pharmacologic substanceMolecular function VitaminOrganism function Element ion or isotopeOrgan or tissue functionPathologic functionPhenomenon or processPhysiologic function

constructed by collecting all the objects from the sentencesAll the relations used in our experiment are PART OFLOCATION OF and ISA

25 Stanford Parser The Stanford Parser was built by theStanford NLP (natural language processing) Group in the1990s It is a program that works out the grammaticalstructure of sentences for instance which groups of wordsgo together (as ldquophrasesrdquo) and which words are the subject orobject of a verbTheprimary function of the StanfordParser isto analyze and extract the syntactic structure of sentences andpart-of-speech tagging (POS tagging) The parser providesStanford Dependencies output as well as phrase structuretrees In our method the features of the graph kernel-basedmethod are extracted by the Stanford Parser [17]

3 Method

For knowledge discovery we split the traditional ABCmodelinto twomodelsmdashABmodel and BCmodel Both models areconstructed by using cotraining methodsThe purpose of ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether thereexists an entity (target term) having physiological effects(linking term) on human beings in a sentence

The supervised learning methods used in our experi-ment are all kernel-based methods and kernel methods areeffective alternatives to explicit feature extraction [18] Theyretain the original representation of objects and use the objectonly via computing a kernel function between a pair ofobjects Such a kernel function makes it possible to computethe similarity between objects without enumerating all thefeatures And the features we employ in our experiment areas follows

31 Feature Sets A kernel can be thought of as a similarityfunction for pairs of objects The following features are usedin our feature-based kernel

(i) Neighboring Words In a sentence the words surroundingtwo concepts have a significant impact on the existence ofthe relationship between the two concepts In our methodthe words surrounding two concepts are considered the

4 BioMed Research International

Table 2 The semantic type filter used in our experiments

Sentence We used hemofiltration to treat a patient with digoxin overdose that was complicatedby refractory hyperkalemia

Extracted predications(subject-relation-object triples)

Hemofiltration-TREATS-PatientsDigoxin overdose-PROCESS OF-PatientsHyperkalemia-COMPLICATES-Digoxin overdoseHemofiltration-TREATS(INFER)-Digoxin overdose

neighboring words feature This feature consists of all thewords that are located between the two concepts wordssurrounding two protein names which include three wordsto the left of the first protein name and three words to theright of the second protein name We add a prefix to eachword to distinguish the different positions of the words Forexample the word ldquowordrdquo is expressed in the above threecases as ldquom wordrdquo ldquol wordrdquo and ldquor wordrdquo respectivelyWhenusing the neighboring word feature a dictionary of the wholecorpus will be established and a sentence is represented asa Boolean vector where ldquo1rdquo means the feature exists in thesentences and ldquo0rdquo means the feature does not exist in thesentence

(ii) Entity Name Distance Under the assumption that theshorter the distance (the number of words) between twoentity names is the more likely the two proteins haveinteraction relation the distance is chosen as a feature Thefeature value is set as the number of words between two entitynames

(iii) Relationship Words Through analyzing the corpus wemake the hypothesis if there exists a relationship betweentwo entity concepts there will be a greater probability thatsome verbs or their variants appear surrounding the twoconcepts such as ldquoactivationrdquo ldquoinducerdquo and ldquomodulaterdquoThese words are used as relationship words in our methodWe build a relationship words list of about 500 and the listis used to determine whether there is a relationship betweenthe two entity concepts in the sentences Boolean ldquo1rdquo meansthe relationship word exists in the sentence and ldquo0rdquo means itdoes not exist

(iv) Negative Words Some negative words such as ldquonotrdquoldquoneitherrdquo and ldquonordquo exist in some sentences and these negativewords express that there is no relationship between two entityconcepts If a negative word and a relationship word cooccurin a sentence it is difficult to judge whether there existsa relationship in the sentences only based on relationshipwords so negativewords features were introduced to improvethe situation Boolean ldquo1rdquo means the negative word exists inthe sentence and ldquo0rdquo means it does not exist

For example in sentence A ldquoQuercetin one of themost representative C0596577 compounds is involved inantiradical C0003402 and prooxidant biological processesrdquoall the features extracted from it are shown in Table 3 (wepreprocessed sentence A with MetaMap and the two targetentities are represented by their CUIs) Finally the sentencewill be represented by a feature vector

Table 3 Features extracted from example sentence A

Feature names Feature valuesLeft words l the l most l representativeWords between twoentity names

m compounds m is m involved m inm antiradical

Right words r and r prooxidant r biologicalRelationship words involvedProtein name distance 5Negative word mdash

32 Graph Kernel In our experiment all the sentencesare parsed by Stanford Parser to generate the output ofdependency path and POS path A graph kernel calculatesthe similarity between two input graphs by comparing therelations between commonvertices (nodes)Thegraph kernelused in our method is the all-paths graph kernel proposed byAirola et al [19] The kernel represents the target pair usinggraph matrices based on two subgraphs where the graphfeatures include all nonzero elements in the graph matricesThe two subgraphs are a parse structure subgraph (PSS) anda linear order subgraph (LOS) as shown in Figure 3 PSSrepresents the parse structure of a sentence and includeswordor link vertices A word vertex contains its lemma and itsPOS while a link vertex contains its link Additionally bothtypes of vertices contain their positions relative to the shortestpath LOS represents the word sequence in the sentence andthus has word vertices each of which contains its lemma itsrelative position to the target pair and its POS

For the calculation two types of matrices are used a labelmatrix 119871 and an edge matrix 119860 The label matrix is a (sparse)119873times119871matrix where119873 is the number of vertices (nodes) and119871 is the number of labels It represents the correspondencebetween labels and vertices where 119871

119894119895is equal to 1 if the 119894th

vertex corresponds to the 119895th label and 0 otherwiseThe edgematrix is a (sparse)119873times119873matrix and represents the relationbetween pairs of vertices where 119860

119894119895is a weight 119908

119894119895if the 119894th

vertex is connected to the 119895th vertex and 0 otherwise Theweight is a predefined constant whereby the edges on theshortest paths are assigned a weight of 09 and other edgesreceive a weight of 03 Using the Neumann Series a graphmatrix 119866 is calculated as

119866 = 119871119879

infin

sum

119899=1119860119899119871 = 119871

119879

((119868 minus119860)minus1

minus 119868) 119871 (1)

This matrix represents the sums of all the path weightsbetween any pair of vertices resulting in entries representing

BioMed Research International 5

PROT1_IPNN_IP

interactsVBZ

withIN

CONCEPTNN

toTO

disassemble_IPVB_IP

PROT2_IPNN_IP

filaments_IPNNS_IP

PROT1NN

Interacts_MVBZ_M

with_MIN_M

CONCEPT_MNN_M

to_MTO_M

disassemble_MVB_IP_M

C2NN

filaments_ANNS_A

09 09 09 09 09 09 09

nsubj03 03 03 03

03

prep_with aux 0303 nmod_IP0909

dobj_IP0909

xubj_IP 0909xcomp 03

Figure 3 Graph representation generated from an example sentence The candidate interaction pair is marked as PROT1 and PROT2 thethird protein is marked as PROT The shortest path between the proteins is shown in bold In the dependency based subgraph all nodes in ashortest path are specialized using a posttag (IP) In the linear order subgraph possible tags are before (B) middle (M) and after (A) For theother two candidate pairs in the sentence graphs with the same structure but different weights and labels would be generated

the strength of the relation between each pair of verticesUsing two input graph matrices 119866 and 1198661015840 the graph kernel119896(119866 119866

1015840

) is the sum of the products of the common relationsrsquoweights given by

119896 (119866 1198661015840

) =

119871

sum

119894=1

119871

sum

119895=11198661198941198951198661015840

119894119895 (2)

33 Cotraining Algorithm Cotraining is a semisupervisedlearning algorithm used when there are only small amountsof labeled data and large amounts of unlabeled data and itrequires two views of the data [20] It assumes that each exam-ple is described using two different feature sets that providedifferent complementary information about the instanceCotraining first learns a separate classifier for each view usingany labeled examplesThemost confident predictions of eachclassifier on the unlabeled data are then used to iterativelyconstruct additional labeled training data In our experimentthe two different feature sets used to describe a sentenceare word features and grammatical features (graph kernelextracted by Stanford Parser) respectively

4 Classification Models

41 Evaluation Metrics In our study the evaluation metricsprecision (119875) recall (119877) and119865-score (119865) are employed whichare defined as follows

119875 =TP

TP + FPlowast 100

119877 =TP

TP + FNlowast 100

119865 =2 lowast 119875 lowast 119877119875 + 119877

lowast 100

(3)

TP is the number of correctly predicted pieces of data FPis the number of false positives and FN is the number of falsenegatives in the test set 119875 is used to evaluate the accuracy

of a model 119877 is recall rate and 119865 is the harmonic mean ofprecision and recall rates

In addition we define the effective linking terms as theterms in a sentence which can connect the initial term andthe target term For example in the case of Raynaudrsquos diseaseand fish oil the concepts of platelet aggregation blood vesselcontraction are all effective linking termsThedefinition of theproportion of the effective linking terms in all linking termsis as follows

119878 =119899

119873lowast 100 (4)

where 119899 is the number of effective linking terms and119873 is thenumber of all the terms which have relationships with initialterms

42 Training AB Model The purpose of the AB model isto determine whether a physiological phenomenon (linkingterm) is caused by a disease (initial term) in a sentence

We obtain all the sentences (the corpus) used inour experiment through searching from Semantic MedlineDatabase with 200 different semantic types about ldquodiseaseor syndromerdquo defined by MeSH (Medical Subject Headings)Then all the sentences are processed with the MetaMap forNamed Entity Recognition (NER) and to limit the semantictypes of initial terms and linking terms in a sentence we filterout the sentenceswhich contain the concepts of the unneededsemantic types Finally we obtain a total of 20895 sentencesand we randomly choose only 1000 sentences (probably 5of all the sentences) for manual annotation then we buildtwo initial labeled data sets as initial training set Tinitial (500labeled sentences) and test set (other 500 labeled sentences)respectively There are two reasons why we randomly choose1000 sentences for manual annotation first manual anno-tation is very time consuming and expensive and at thesame time unlabeled training data are usually much easier toobtain This is also why we introduce the cotraining methodto improve the performance of our experiment And the otherreason is when cotraining style algorithms are executed the

6 BioMed Research International

Table 4 Annotation values of two annotators

Reviewer 2 Reviewer 1Positive Negative Kappa score

Positive 385 22Negative 43 550Ours 08664Light et al [16] 068

number of labeled examples is usually small [21] so we selectabout 5 of the corpus as training set and test set and theother 95 of the corpus is used for extending the training setlater

During the manual annotation the following criteria aremet if a sentence contains a relationship belonging to thesemantic type list (see Table 1) between two entity conceptswe think there is a positive correlation between the two enti-ties and label the sentence as a positive example In additionsome special relationships such as ldquoB in Ardquo and ldquoA can changeBrdquo are also classified as positive examples since they mean aphysiological phenomenon (B) occurs when someone has thedisease (A) If there is no obvious relationship in a sentenceand only a cooccurrence relationship we label it as a negativeexample For the patterns such as ldquoA is a Brdquo and ldquoA and Brdquowe label them as negative examples since ldquoA is a Brdquo is ldquoIS Ardquorelationship and ldquoA and Brdquo is a coordination relationshipand they are not the relationship we need When the processwas completed we estimated the level of agreement Cohenrsquoskappa [22] score between each annotator as shown in Table 4is 08664 and content analysis researchers generally think ofa Cohenrsquos kappa score more than 08 as good reliability [22]Comparably Light et al achieved their highest kappa value(068) in their manual annotation experiments on Medlineabstracts [16]

As shown in Figure 4 the process of training AB modelis as follows at first two initial SVM classifiers M0

1 (graph-based kernel) and M0

2 (feature-based kernel) are trainedusing initial training set Tinitial which contains 500 labeledexamples and the test results of two classifiers are as followswith classifier M0

1 the values of precision recall and F-scoreare 7288 8333 and 7776 respectively And at the sametime the results of classifier M0

2 are 7435 7633 and7533 respectively

In the next step 2000 unlabeled sentences (a section ofthe unlabeled corpus) are predicated by classifiers M0

1 andM0

2 respectively The classifier M01 selects the top 200 scores

(10of 2000 results classified byM01) as the positive examples

and the bottom 200 scores as the negative examples then weput these 400 newly labeled examples into the data set Tinitial(500) as a new training set T1

2 (900 examples) And at the sametime the classifierM0

2 selects the top 200 scores (10 of 2000results classified by M0

2) as the positive examples and thebottom 200 scores as the negative examples thenwe put these400 newly labeled examples into the data set Tinitial (500) as anew training set T1

1 (900 examples) Then two new classifiersM1

1 and M12 are trained from T1

1 and T12 respectively After

that 3000 unlabeled sentences (a section of the unlabeled

3000

6000

unlabeledsentences

unlabeledsentences

2000 unlabeledsentences PredicatePredicate

Predicate Predicate

400 pieces of labeled data400 pieces of labeled data

1200 pieces of labeled data1200 pieces of labeled data

Tinitial Tinitial

M0

1 M0

2

T1

2

M5

1 M5

2

T5

2

T5

1

T1

1

M1

1 M1

2

Figure 4 The process of training AB model

corpus) are classified by M11 and M1

2 respectively M11 selects

the top 300 scores (10 of 3000 results classified by M11)

as positive examples and the bottom 300 scores as negativeexamples We use the 600 newly labeled examples to updatethe T1

2 and thenwe obtain a new training set T22 whileM

12 also

selects 600 newly labeled examples to update T11 and then we

obtain the new training set T21 Such a process is repeated for

a preset number of learning roundsIn our experiment the preset number of learning rounds

is five and the sizes of corresponding extended set are 20003000 4000 5000 and 6000 respectively And the size oftraining set increases from the initial 500 (Tinitial) to 900(T1

1 and T12) 1500 (T2

1 and T22) 2300 (T3

1 and T32) 3300

(T41 and T4

2) and 4500 (T51 and T5

2) respectively Thereare two reasons why we set five times of learning roundsFirst after five learning rounds there are no extra unlabeledsentences At first 95 of the corpus (20895) is used forextending the training set and after five times of expansionof the training set there are no extra unlabeled data And thesecond reason is the cotraining process could not improvethe performance further after a number of rounds [21] andwe should terminate cotraining on an appropriate round toavoid some wasteful learning rounds

The best testing result is from the second learning round(the classifier is M2

1 (graph kernel) and the training set is

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

BioMed Research International 3

A C

B1

B2

B3

B3

Bn

Figure 2 Closed discovery process the process starts simultane-ously from A and C resulting in overlapping Bs

Library of Medicine (NLM) to map biomedical text to theUMLS Metathesaurus or equivalently to identify Metathe-saurus concepts referred to in English text Because everyconcept in the Metathesaurus has been assigned one or moresemantic types we filter the results of the text-to-conceptmapping process by means of the semantic types In thedifferent stages of the process we employ different semanticfilters For example in the stage of selecting intermediate Bterms we choose the terms with functional semantic typessuch as biologic function cell function phenomenon orprocess and physiologic function When selecting dietaryfactors as A terms we choose the concepts with functionalsemantic types such as vitamin lipid and element ion orisotope Table 1 provides the functional semantic types thatwe use to filter the linking terms and target terms in ourexperiments

23 SemRep and SemMedDB SemRep was developed at theNational Library of Medicine and is a program that extractssemantic predications (subject-relation-object triples) frombiomedical free text For example from the sentence inTable 2 SemRep extracts four predications (as shown inTable 2) Semantic Medline Database (SemMedDB) is arepository of semantic predications extracted by SemRepSemMedDB currently contains information about approxi-mately 70 million predications from all of PubMed citations(more than 235 million citations as of April 1 2014) andforms the backbone of Semantic Medline application [15]

24 General Concepts Filtering Many general concepts willbe generated after the processing of Named Entity Recog-nition (NER) by MetaMap General concepts refer to theterms which have relationships with many entities but havemeaningless concept such as ldquodiseaserdquo Because the existenceof general concepts may reduce the effect of knowledgediscovery they need to be filtered In our experiment first weextract all the sentences with relations such as ldquosubject ISAobjectrdquo from SemMedDB and then a general concept list is

Table 1 The semantic type filter used in our experiments

Linking term Target termBiologic function Organic chemicalCell function LipidFinding Pharmacologic substanceMolecular function VitaminOrganism function Element ion or isotopeOrgan or tissue functionPathologic functionPhenomenon or processPhysiologic function

constructed by collecting all the objects from the sentencesAll the relations used in our experiment are PART OFLOCATION OF and ISA

25 Stanford Parser The Stanford Parser was built by theStanford NLP (natural language processing) Group in the1990s It is a program that works out the grammaticalstructure of sentences for instance which groups of wordsgo together (as ldquophrasesrdquo) and which words are the subject orobject of a verbTheprimary function of the StanfordParser isto analyze and extract the syntactic structure of sentences andpart-of-speech tagging (POS tagging) The parser providesStanford Dependencies output as well as phrase structuretrees In our method the features of the graph kernel-basedmethod are extracted by the Stanford Parser [17]

3 Method

For knowledge discovery we split the traditional ABCmodelinto twomodelsmdashABmodel and BCmodel Both models areconstructed by using cotraining methodsThe purpose of ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether thereexists an entity (target term) having physiological effects(linking term) on human beings in a sentence

The supervised learning methods used in our experi-ment are all kernel-based methods and kernel methods areeffective alternatives to explicit feature extraction [18] Theyretain the original representation of objects and use the objectonly via computing a kernel function between a pair ofobjects Such a kernel function makes it possible to computethe similarity between objects without enumerating all thefeatures And the features we employ in our experiment areas follows

31 Feature Sets A kernel can be thought of as a similarityfunction for pairs of objects The following features are usedin our feature-based kernel

(i) Neighboring Words In a sentence the words surroundingtwo concepts have a significant impact on the existence ofthe relationship between the two concepts In our methodthe words surrounding two concepts are considered the

4 BioMed Research International

Table 2 The semantic type filter used in our experiments

Sentence We used hemofiltration to treat a patient with digoxin overdose that was complicatedby refractory hyperkalemia

Extracted predications(subject-relation-object triples)

Hemofiltration-TREATS-PatientsDigoxin overdose-PROCESS OF-PatientsHyperkalemia-COMPLICATES-Digoxin overdoseHemofiltration-TREATS(INFER)-Digoxin overdose

neighboring words feature This feature consists of all thewords that are located between the two concepts wordssurrounding two protein names which include three wordsto the left of the first protein name and three words to theright of the second protein name We add a prefix to eachword to distinguish the different positions of the words Forexample the word ldquowordrdquo is expressed in the above threecases as ldquom wordrdquo ldquol wordrdquo and ldquor wordrdquo respectivelyWhenusing the neighboring word feature a dictionary of the wholecorpus will be established and a sentence is represented asa Boolean vector where ldquo1rdquo means the feature exists in thesentences and ldquo0rdquo means the feature does not exist in thesentence

(ii) Entity Name Distance Under the assumption that theshorter the distance (the number of words) between twoentity names is the more likely the two proteins haveinteraction relation the distance is chosen as a feature Thefeature value is set as the number of words between two entitynames

(iii) Relationship Words Through analyzing the corpus wemake the hypothesis if there exists a relationship betweentwo entity concepts there will be a greater probability thatsome verbs or their variants appear surrounding the twoconcepts such as ldquoactivationrdquo ldquoinducerdquo and ldquomodulaterdquoThese words are used as relationship words in our methodWe build a relationship words list of about 500 and the listis used to determine whether there is a relationship betweenthe two entity concepts in the sentences Boolean ldquo1rdquo meansthe relationship word exists in the sentence and ldquo0rdquo means itdoes not exist

(iv) Negative Words Some negative words such as ldquonotrdquoldquoneitherrdquo and ldquonordquo exist in some sentences and these negativewords express that there is no relationship between two entityconcepts If a negative word and a relationship word cooccurin a sentence it is difficult to judge whether there existsa relationship in the sentences only based on relationshipwords so negativewords features were introduced to improvethe situation Boolean ldquo1rdquo means the negative word exists inthe sentence and ldquo0rdquo means it does not exist

For example in sentence A ldquoQuercetin one of themost representative C0596577 compounds is involved inantiradical C0003402 and prooxidant biological processesrdquoall the features extracted from it are shown in Table 3 (wepreprocessed sentence A with MetaMap and the two targetentities are represented by their CUIs) Finally the sentencewill be represented by a feature vector

Table 3 Features extracted from example sentence A

Feature names Feature valuesLeft words l the l most l representativeWords between twoentity names

m compounds m is m involved m inm antiradical

Right words r and r prooxidant r biologicalRelationship words involvedProtein name distance 5Negative word mdash

32 Graph Kernel In our experiment all the sentencesare parsed by Stanford Parser to generate the output ofdependency path and POS path A graph kernel calculatesthe similarity between two input graphs by comparing therelations between commonvertices (nodes)Thegraph kernelused in our method is the all-paths graph kernel proposed byAirola et al [19] The kernel represents the target pair usinggraph matrices based on two subgraphs where the graphfeatures include all nonzero elements in the graph matricesThe two subgraphs are a parse structure subgraph (PSS) anda linear order subgraph (LOS) as shown in Figure 3 PSSrepresents the parse structure of a sentence and includeswordor link vertices A word vertex contains its lemma and itsPOS while a link vertex contains its link Additionally bothtypes of vertices contain their positions relative to the shortestpath LOS represents the word sequence in the sentence andthus has word vertices each of which contains its lemma itsrelative position to the target pair and its POS

For the calculation two types of matrices are used a labelmatrix 119871 and an edge matrix 119860 The label matrix is a (sparse)119873times119871matrix where119873 is the number of vertices (nodes) and119871 is the number of labels It represents the correspondencebetween labels and vertices where 119871

119894119895is equal to 1 if the 119894th

vertex corresponds to the 119895th label and 0 otherwiseThe edgematrix is a (sparse)119873times119873matrix and represents the relationbetween pairs of vertices where 119860

119894119895is a weight 119908

119894119895if the 119894th

vertex is connected to the 119895th vertex and 0 otherwise Theweight is a predefined constant whereby the edges on theshortest paths are assigned a weight of 09 and other edgesreceive a weight of 03 Using the Neumann Series a graphmatrix 119866 is calculated as

119866 = 119871119879

infin

sum

119899=1119860119899119871 = 119871

119879

((119868 minus119860)minus1

minus 119868) 119871 (1)

This matrix represents the sums of all the path weightsbetween any pair of vertices resulting in entries representing

BioMed Research International 5

PROT1_IPNN_IP

interactsVBZ

withIN

CONCEPTNN

toTO

disassemble_IPVB_IP

PROT2_IPNN_IP

filaments_IPNNS_IP

PROT1NN

Interacts_MVBZ_M

with_MIN_M

CONCEPT_MNN_M

to_MTO_M

disassemble_MVB_IP_M

C2NN

filaments_ANNS_A

09 09 09 09 09 09 09

nsubj03 03 03 03

03

prep_with aux 0303 nmod_IP0909

dobj_IP0909

xubj_IP 0909xcomp 03

Figure 3 Graph representation generated from an example sentence The candidate interaction pair is marked as PROT1 and PROT2 thethird protein is marked as PROT The shortest path between the proteins is shown in bold In the dependency based subgraph all nodes in ashortest path are specialized using a posttag (IP) In the linear order subgraph possible tags are before (B) middle (M) and after (A) For theother two candidate pairs in the sentence graphs with the same structure but different weights and labels would be generated

the strength of the relation between each pair of verticesUsing two input graph matrices 119866 and 1198661015840 the graph kernel119896(119866 119866

1015840

) is the sum of the products of the common relationsrsquoweights given by

119896 (119866 1198661015840

) =

119871

sum

119894=1

119871

sum

119895=11198661198941198951198661015840

119894119895 (2)

33 Cotraining Algorithm Cotraining is a semisupervisedlearning algorithm used when there are only small amountsof labeled data and large amounts of unlabeled data and itrequires two views of the data [20] It assumes that each exam-ple is described using two different feature sets that providedifferent complementary information about the instanceCotraining first learns a separate classifier for each view usingany labeled examplesThemost confident predictions of eachclassifier on the unlabeled data are then used to iterativelyconstruct additional labeled training data In our experimentthe two different feature sets used to describe a sentenceare word features and grammatical features (graph kernelextracted by Stanford Parser) respectively

4 Classification Models

41 Evaluation Metrics In our study the evaluation metricsprecision (119875) recall (119877) and119865-score (119865) are employed whichare defined as follows

119875 =TP

TP + FPlowast 100

119877 =TP

TP + FNlowast 100

119865 =2 lowast 119875 lowast 119877119875 + 119877

lowast 100

(3)

TP is the number of correctly predicted pieces of data FPis the number of false positives and FN is the number of falsenegatives in the test set 119875 is used to evaluate the accuracy

of a model 119877 is recall rate and 119865 is the harmonic mean ofprecision and recall rates

In addition we define the effective linking terms as theterms in a sentence which can connect the initial term andthe target term For example in the case of Raynaudrsquos diseaseand fish oil the concepts of platelet aggregation blood vesselcontraction are all effective linking termsThedefinition of theproportion of the effective linking terms in all linking termsis as follows

119878 =119899

119873lowast 100 (4)

where 119899 is the number of effective linking terms and119873 is thenumber of all the terms which have relationships with initialterms

42 Training AB Model The purpose of the AB model isto determine whether a physiological phenomenon (linkingterm) is caused by a disease (initial term) in a sentence

We obtain all the sentences (the corpus) used inour experiment through searching from Semantic MedlineDatabase with 200 different semantic types about ldquodiseaseor syndromerdquo defined by MeSH (Medical Subject Headings)Then all the sentences are processed with the MetaMap forNamed Entity Recognition (NER) and to limit the semantictypes of initial terms and linking terms in a sentence we filterout the sentenceswhich contain the concepts of the unneededsemantic types Finally we obtain a total of 20895 sentencesand we randomly choose only 1000 sentences (probably 5of all the sentences) for manual annotation then we buildtwo initial labeled data sets as initial training set Tinitial (500labeled sentences) and test set (other 500 labeled sentences)respectively There are two reasons why we randomly choose1000 sentences for manual annotation first manual anno-tation is very time consuming and expensive and at thesame time unlabeled training data are usually much easier toobtain This is also why we introduce the cotraining methodto improve the performance of our experiment And the otherreason is when cotraining style algorithms are executed the

6 BioMed Research International

Table 4 Annotation values of two annotators

Reviewer 2 Reviewer 1Positive Negative Kappa score

Positive 385 22Negative 43 550Ours 08664Light et al [16] 068

number of labeled examples is usually small [21] so we selectabout 5 of the corpus as training set and test set and theother 95 of the corpus is used for extending the training setlater

During the manual annotation the following criteria aremet if a sentence contains a relationship belonging to thesemantic type list (see Table 1) between two entity conceptswe think there is a positive correlation between the two enti-ties and label the sentence as a positive example In additionsome special relationships such as ldquoB in Ardquo and ldquoA can changeBrdquo are also classified as positive examples since they mean aphysiological phenomenon (B) occurs when someone has thedisease (A) If there is no obvious relationship in a sentenceand only a cooccurrence relationship we label it as a negativeexample For the patterns such as ldquoA is a Brdquo and ldquoA and Brdquowe label them as negative examples since ldquoA is a Brdquo is ldquoIS Ardquorelationship and ldquoA and Brdquo is a coordination relationshipand they are not the relationship we need When the processwas completed we estimated the level of agreement Cohenrsquoskappa [22] score between each annotator as shown in Table 4is 08664 and content analysis researchers generally think ofa Cohenrsquos kappa score more than 08 as good reliability [22]Comparably Light et al achieved their highest kappa value(068) in their manual annotation experiments on Medlineabstracts [16]

As shown in Figure 4 the process of training AB modelis as follows at first two initial SVM classifiers M0

1 (graph-based kernel) and M0

2 (feature-based kernel) are trainedusing initial training set Tinitial which contains 500 labeledexamples and the test results of two classifiers are as followswith classifier M0

1 the values of precision recall and F-scoreare 7288 8333 and 7776 respectively And at the sametime the results of classifier M0

2 are 7435 7633 and7533 respectively

In the next step 2000 unlabeled sentences (a section ofthe unlabeled corpus) are predicated by classifiers M0

1 andM0

2 respectively The classifier M01 selects the top 200 scores

(10of 2000 results classified byM01) as the positive examples

and the bottom 200 scores as the negative examples then weput these 400 newly labeled examples into the data set Tinitial(500) as a new training set T1

2 (900 examples) And at the sametime the classifierM0

2 selects the top 200 scores (10 of 2000results classified by M0

2) as the positive examples and thebottom 200 scores as the negative examples thenwe put these400 newly labeled examples into the data set Tinitial (500) as anew training set T1

1 (900 examples) Then two new classifiersM1

1 and M12 are trained from T1

1 and T12 respectively After

that 3000 unlabeled sentences (a section of the unlabeled

3000

6000

unlabeledsentences

unlabeledsentences

2000 unlabeledsentences PredicatePredicate

Predicate Predicate

400 pieces of labeled data400 pieces of labeled data

1200 pieces of labeled data1200 pieces of labeled data

Tinitial Tinitial

M0

1 M0

2

T1

2

M5

1 M5

2

T5

2

T5

1

T1

1

M1

1 M1

2

Figure 4 The process of training AB model

corpus) are classified by M11 and M1

2 respectively M11 selects

the top 300 scores (10 of 3000 results classified by M11)

as positive examples and the bottom 300 scores as negativeexamples We use the 600 newly labeled examples to updatethe T1

2 and thenwe obtain a new training set T22 whileM

12 also

selects 600 newly labeled examples to update T11 and then we

obtain the new training set T21 Such a process is repeated for

a preset number of learning roundsIn our experiment the preset number of learning rounds

is five and the sizes of corresponding extended set are 20003000 4000 5000 and 6000 respectively And the size oftraining set increases from the initial 500 (Tinitial) to 900(T1

1 and T12) 1500 (T2

1 and T22) 2300 (T3

1 and T32) 3300

(T41 and T4

2) and 4500 (T51 and T5

2) respectively Thereare two reasons why we set five times of learning roundsFirst after five learning rounds there are no extra unlabeledsentences At first 95 of the corpus (20895) is used forextending the training set and after five times of expansionof the training set there are no extra unlabeled data And thesecond reason is the cotraining process could not improvethe performance further after a number of rounds [21] andwe should terminate cotraining on an appropriate round toavoid some wasteful learning rounds

The best testing result is from the second learning round(the classifier is M2

1 (graph kernel) and the training set is

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

4 BioMed Research International

Table 2 The semantic type filter used in our experiments

Sentence We used hemofiltration to treat a patient with digoxin overdose that was complicatedby refractory hyperkalemia

Extracted predications(subject-relation-object triples)

Hemofiltration-TREATS-PatientsDigoxin overdose-PROCESS OF-PatientsHyperkalemia-COMPLICATES-Digoxin overdoseHemofiltration-TREATS(INFER)-Digoxin overdose

neighboring words feature This feature consists of all thewords that are located between the two concepts wordssurrounding two protein names which include three wordsto the left of the first protein name and three words to theright of the second protein name We add a prefix to eachword to distinguish the different positions of the words Forexample the word ldquowordrdquo is expressed in the above threecases as ldquom wordrdquo ldquol wordrdquo and ldquor wordrdquo respectivelyWhenusing the neighboring word feature a dictionary of the wholecorpus will be established and a sentence is represented asa Boolean vector where ldquo1rdquo means the feature exists in thesentences and ldquo0rdquo means the feature does not exist in thesentence

(ii) Entity Name Distance Under the assumption that theshorter the distance (the number of words) between twoentity names is the more likely the two proteins haveinteraction relation the distance is chosen as a feature Thefeature value is set as the number of words between two entitynames

(iii) Relationship Words Through analyzing the corpus wemake the hypothesis if there exists a relationship betweentwo entity concepts there will be a greater probability thatsome verbs or their variants appear surrounding the twoconcepts such as ldquoactivationrdquo ldquoinducerdquo and ldquomodulaterdquoThese words are used as relationship words in our methodWe build a relationship words list of about 500 and the listis used to determine whether there is a relationship betweenthe two entity concepts in the sentences Boolean ldquo1rdquo meansthe relationship word exists in the sentence and ldquo0rdquo means itdoes not exist

(iv) Negative Words Some negative words such as ldquonotrdquoldquoneitherrdquo and ldquonordquo exist in some sentences and these negativewords express that there is no relationship between two entityconcepts If a negative word and a relationship word cooccurin a sentence it is difficult to judge whether there existsa relationship in the sentences only based on relationshipwords so negativewords features were introduced to improvethe situation Boolean ldquo1rdquo means the negative word exists inthe sentence and ldquo0rdquo means it does not exist

For example in sentence A ldquoQuercetin one of themost representative C0596577 compounds is involved inantiradical C0003402 and prooxidant biological processesrdquoall the features extracted from it are shown in Table 3 (wepreprocessed sentence A with MetaMap and the two targetentities are represented by their CUIs) Finally the sentencewill be represented by a feature vector

Table 3 Features extracted from example sentence A

Feature names Feature valuesLeft words l the l most l representativeWords between twoentity names

m compounds m is m involved m inm antiradical

Right words r and r prooxidant r biologicalRelationship words involvedProtein name distance 5Negative word mdash

32 Graph Kernel In our experiment all the sentencesare parsed by Stanford Parser to generate the output ofdependency path and POS path A graph kernel calculatesthe similarity between two input graphs by comparing therelations between commonvertices (nodes)Thegraph kernelused in our method is the all-paths graph kernel proposed byAirola et al [19] The kernel represents the target pair usinggraph matrices based on two subgraphs where the graphfeatures include all nonzero elements in the graph matricesThe two subgraphs are a parse structure subgraph (PSS) anda linear order subgraph (LOS) as shown in Figure 3 PSSrepresents the parse structure of a sentence and includeswordor link vertices A word vertex contains its lemma and itsPOS while a link vertex contains its link Additionally bothtypes of vertices contain their positions relative to the shortestpath LOS represents the word sequence in the sentence andthus has word vertices each of which contains its lemma itsrelative position to the target pair and its POS

For the calculation two types of matrices are used a labelmatrix 119871 and an edge matrix 119860 The label matrix is a (sparse)119873times119871matrix where119873 is the number of vertices (nodes) and119871 is the number of labels It represents the correspondencebetween labels and vertices where 119871

119894119895is equal to 1 if the 119894th

vertex corresponds to the 119895th label and 0 otherwiseThe edgematrix is a (sparse)119873times119873matrix and represents the relationbetween pairs of vertices where 119860

119894119895is a weight 119908

119894119895if the 119894th

vertex is connected to the 119895th vertex and 0 otherwise Theweight is a predefined constant whereby the edges on theshortest paths are assigned a weight of 09 and other edgesreceive a weight of 03 Using the Neumann Series a graphmatrix 119866 is calculated as

119866 = 119871119879

infin

sum

119899=1119860119899119871 = 119871

119879

((119868 minus119860)minus1

minus 119868) 119871 (1)

This matrix represents the sums of all the path weightsbetween any pair of vertices resulting in entries representing

BioMed Research International 5

PROT1_IPNN_IP

interactsVBZ

withIN

CONCEPTNN

toTO

disassemble_IPVB_IP

PROT2_IPNN_IP

filaments_IPNNS_IP

PROT1NN

Interacts_MVBZ_M

with_MIN_M

CONCEPT_MNN_M

to_MTO_M

disassemble_MVB_IP_M

C2NN

filaments_ANNS_A

09 09 09 09 09 09 09

nsubj03 03 03 03

03

prep_with aux 0303 nmod_IP0909

dobj_IP0909

xubj_IP 0909xcomp 03

Figure 3 Graph representation generated from an example sentence The candidate interaction pair is marked as PROT1 and PROT2 thethird protein is marked as PROT The shortest path between the proteins is shown in bold In the dependency based subgraph all nodes in ashortest path are specialized using a posttag (IP) In the linear order subgraph possible tags are before (B) middle (M) and after (A) For theother two candidate pairs in the sentence graphs with the same structure but different weights and labels would be generated

the strength of the relation between each pair of verticesUsing two input graph matrices 119866 and 1198661015840 the graph kernel119896(119866 119866

1015840

) is the sum of the products of the common relationsrsquoweights given by

119896 (119866 1198661015840

) =

119871

sum

119894=1

119871

sum

119895=11198661198941198951198661015840

119894119895 (2)

33 Cotraining Algorithm Cotraining is a semisupervisedlearning algorithm used when there are only small amountsof labeled data and large amounts of unlabeled data and itrequires two views of the data [20] It assumes that each exam-ple is described using two different feature sets that providedifferent complementary information about the instanceCotraining first learns a separate classifier for each view usingany labeled examplesThemost confident predictions of eachclassifier on the unlabeled data are then used to iterativelyconstruct additional labeled training data In our experimentthe two different feature sets used to describe a sentenceare word features and grammatical features (graph kernelextracted by Stanford Parser) respectively

4 Classification Models

41 Evaluation Metrics In our study the evaluation metricsprecision (119875) recall (119877) and119865-score (119865) are employed whichare defined as follows

119875 =TP

TP + FPlowast 100

119877 =TP

TP + FNlowast 100

119865 =2 lowast 119875 lowast 119877119875 + 119877

lowast 100

(3)

TP is the number of correctly predicted pieces of data FPis the number of false positives and FN is the number of falsenegatives in the test set 119875 is used to evaluate the accuracy

of a model 119877 is recall rate and 119865 is the harmonic mean ofprecision and recall rates

In addition we define the effective linking terms as theterms in a sentence which can connect the initial term andthe target term For example in the case of Raynaudrsquos diseaseand fish oil the concepts of platelet aggregation blood vesselcontraction are all effective linking termsThedefinition of theproportion of the effective linking terms in all linking termsis as follows

119878 =119899

119873lowast 100 (4)

where 119899 is the number of effective linking terms and119873 is thenumber of all the terms which have relationships with initialterms

42 Training AB Model The purpose of the AB model isto determine whether a physiological phenomenon (linkingterm) is caused by a disease (initial term) in a sentence

We obtain all the sentences (the corpus) used inour experiment through searching from Semantic MedlineDatabase with 200 different semantic types about ldquodiseaseor syndromerdquo defined by MeSH (Medical Subject Headings)Then all the sentences are processed with the MetaMap forNamed Entity Recognition (NER) and to limit the semantictypes of initial terms and linking terms in a sentence we filterout the sentenceswhich contain the concepts of the unneededsemantic types Finally we obtain a total of 20895 sentencesand we randomly choose only 1000 sentences (probably 5of all the sentences) for manual annotation then we buildtwo initial labeled data sets as initial training set Tinitial (500labeled sentences) and test set (other 500 labeled sentences)respectively There are two reasons why we randomly choose1000 sentences for manual annotation first manual anno-tation is very time consuming and expensive and at thesame time unlabeled training data are usually much easier toobtain This is also why we introduce the cotraining methodto improve the performance of our experiment And the otherreason is when cotraining style algorithms are executed the

6 BioMed Research International

Table 4 Annotation values of two annotators

Reviewer 2 Reviewer 1Positive Negative Kappa score

Positive 385 22Negative 43 550Ours 08664Light et al [16] 068

number of labeled examples is usually small [21] so we selectabout 5 of the corpus as training set and test set and theother 95 of the corpus is used for extending the training setlater

During the manual annotation the following criteria aremet if a sentence contains a relationship belonging to thesemantic type list (see Table 1) between two entity conceptswe think there is a positive correlation between the two enti-ties and label the sentence as a positive example In additionsome special relationships such as ldquoB in Ardquo and ldquoA can changeBrdquo are also classified as positive examples since they mean aphysiological phenomenon (B) occurs when someone has thedisease (A) If there is no obvious relationship in a sentenceand only a cooccurrence relationship we label it as a negativeexample For the patterns such as ldquoA is a Brdquo and ldquoA and Brdquowe label them as negative examples since ldquoA is a Brdquo is ldquoIS Ardquorelationship and ldquoA and Brdquo is a coordination relationshipand they are not the relationship we need When the processwas completed we estimated the level of agreement Cohenrsquoskappa [22] score between each annotator as shown in Table 4is 08664 and content analysis researchers generally think ofa Cohenrsquos kappa score more than 08 as good reliability [22]Comparably Light et al achieved their highest kappa value(068) in their manual annotation experiments on Medlineabstracts [16]

As shown in Figure 4 the process of training AB modelis as follows at first two initial SVM classifiers M0

1 (graph-based kernel) and M0

2 (feature-based kernel) are trainedusing initial training set Tinitial which contains 500 labeledexamples and the test results of two classifiers are as followswith classifier M0

1 the values of precision recall and F-scoreare 7288 8333 and 7776 respectively And at the sametime the results of classifier M0

2 are 7435 7633 and7533 respectively

In the next step 2000 unlabeled sentences (a section ofthe unlabeled corpus) are predicated by classifiers M0

1 andM0

2 respectively The classifier M01 selects the top 200 scores

(10of 2000 results classified byM01) as the positive examples

and the bottom 200 scores as the negative examples then weput these 400 newly labeled examples into the data set Tinitial(500) as a new training set T1

2 (900 examples) And at the sametime the classifierM0

2 selects the top 200 scores (10 of 2000results classified by M0

2) as the positive examples and thebottom 200 scores as the negative examples thenwe put these400 newly labeled examples into the data set Tinitial (500) as anew training set T1

1 (900 examples) Then two new classifiersM1

1 and M12 are trained from T1

1 and T12 respectively After

that 3000 unlabeled sentences (a section of the unlabeled

3000

6000

unlabeledsentences

unlabeledsentences

2000 unlabeledsentences PredicatePredicate

Predicate Predicate

400 pieces of labeled data400 pieces of labeled data

1200 pieces of labeled data1200 pieces of labeled data

Tinitial Tinitial

M0

1 M0

2

T1

2

M5

1 M5

2

T5

2

T5

1

T1

1

M1

1 M1

2

Figure 4 The process of training AB model

corpus) are classified by M11 and M1

2 respectively M11 selects

the top 300 scores (10 of 3000 results classified by M11)

as positive examples and the bottom 300 scores as negativeexamples We use the 600 newly labeled examples to updatethe T1

2 and thenwe obtain a new training set T22 whileM

12 also

selects 600 newly labeled examples to update T11 and then we

obtain the new training set T21 Such a process is repeated for

a preset number of learning roundsIn our experiment the preset number of learning rounds

is five and the sizes of corresponding extended set are 20003000 4000 5000 and 6000 respectively And the size oftraining set increases from the initial 500 (Tinitial) to 900(T1

1 and T12) 1500 (T2

1 and T22) 2300 (T3

1 and T32) 3300

(T41 and T4

2) and 4500 (T51 and T5

2) respectively Thereare two reasons why we set five times of learning roundsFirst after five learning rounds there are no extra unlabeledsentences At first 95 of the corpus (20895) is used forextending the training set and after five times of expansionof the training set there are no extra unlabeled data And thesecond reason is the cotraining process could not improvethe performance further after a number of rounds [21] andwe should terminate cotraining on an appropriate round toavoid some wasteful learning rounds

The best testing result is from the second learning round(the classifier is M2

1 (graph kernel) and the training set is

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

BioMed Research International 5

PROT1_IPNN_IP

interactsVBZ

withIN

CONCEPTNN

toTO

disassemble_IPVB_IP

PROT2_IPNN_IP

filaments_IPNNS_IP

PROT1NN

Interacts_MVBZ_M

with_MIN_M

CONCEPT_MNN_M

to_MTO_M

disassemble_MVB_IP_M

C2NN

filaments_ANNS_A

09 09 09 09 09 09 09

nsubj03 03 03 03

03

prep_with aux 0303 nmod_IP0909

dobj_IP0909

xubj_IP 0909xcomp 03

Figure 3 Graph representation generated from an example sentence The candidate interaction pair is marked as PROT1 and PROT2 thethird protein is marked as PROT The shortest path between the proteins is shown in bold In the dependency based subgraph all nodes in ashortest path are specialized using a posttag (IP) In the linear order subgraph possible tags are before (B) middle (M) and after (A) For theother two candidate pairs in the sentence graphs with the same structure but different weights and labels would be generated

the strength of the relation between each pair of verticesUsing two input graph matrices 119866 and 1198661015840 the graph kernel119896(119866 119866

1015840

) is the sum of the products of the common relationsrsquoweights given by

119896 (119866 1198661015840

) =

119871

sum

119894=1

119871

sum

119895=11198661198941198951198661015840

119894119895 (2)

33 Cotraining Algorithm Cotraining is a semisupervisedlearning algorithm used when there are only small amountsof labeled data and large amounts of unlabeled data and itrequires two views of the data [20] It assumes that each exam-ple is described using two different feature sets that providedifferent complementary information about the instanceCotraining first learns a separate classifier for each view usingany labeled examplesThemost confident predictions of eachclassifier on the unlabeled data are then used to iterativelyconstruct additional labeled training data In our experimentthe two different feature sets used to describe a sentenceare word features and grammatical features (graph kernelextracted by Stanford Parser) respectively

4 Classification Models

41 Evaluation Metrics In our study the evaluation metricsprecision (119875) recall (119877) and119865-score (119865) are employed whichare defined as follows

119875 =TP

TP + FPlowast 100

119877 =TP

TP + FNlowast 100

119865 =2 lowast 119875 lowast 119877119875 + 119877

lowast 100

(3)

TP is the number of correctly predicted pieces of data FPis the number of false positives and FN is the number of falsenegatives in the test set 119875 is used to evaluate the accuracy

of a model 119877 is recall rate and 119865 is the harmonic mean ofprecision and recall rates

In addition we define the effective linking terms as theterms in a sentence which can connect the initial term andthe target term For example in the case of Raynaudrsquos diseaseand fish oil the concepts of platelet aggregation blood vesselcontraction are all effective linking termsThedefinition of theproportion of the effective linking terms in all linking termsis as follows

119878 =119899

119873lowast 100 (4)

where 119899 is the number of effective linking terms and119873 is thenumber of all the terms which have relationships with initialterms

42 Training AB Model The purpose of the AB model isto determine whether a physiological phenomenon (linkingterm) is caused by a disease (initial term) in a sentence

We obtain all the sentences (the corpus) used inour experiment through searching from Semantic MedlineDatabase with 200 different semantic types about ldquodiseaseor syndromerdquo defined by MeSH (Medical Subject Headings)Then all the sentences are processed with the MetaMap forNamed Entity Recognition (NER) and to limit the semantictypes of initial terms and linking terms in a sentence we filterout the sentenceswhich contain the concepts of the unneededsemantic types Finally we obtain a total of 20895 sentencesand we randomly choose only 1000 sentences (probably 5of all the sentences) for manual annotation then we buildtwo initial labeled data sets as initial training set Tinitial (500labeled sentences) and test set (other 500 labeled sentences)respectively There are two reasons why we randomly choose1000 sentences for manual annotation first manual anno-tation is very time consuming and expensive and at thesame time unlabeled training data are usually much easier toobtain This is also why we introduce the cotraining methodto improve the performance of our experiment And the otherreason is when cotraining style algorithms are executed the

6 BioMed Research International

Table 4 Annotation values of two annotators

Reviewer 2 Reviewer 1Positive Negative Kappa score

Positive 385 22Negative 43 550Ours 08664Light et al [16] 068

number of labeled examples is usually small [21] so we selectabout 5 of the corpus as training set and test set and theother 95 of the corpus is used for extending the training setlater

During the manual annotation the following criteria aremet if a sentence contains a relationship belonging to thesemantic type list (see Table 1) between two entity conceptswe think there is a positive correlation between the two enti-ties and label the sentence as a positive example In additionsome special relationships such as ldquoB in Ardquo and ldquoA can changeBrdquo are also classified as positive examples since they mean aphysiological phenomenon (B) occurs when someone has thedisease (A) If there is no obvious relationship in a sentenceand only a cooccurrence relationship we label it as a negativeexample For the patterns such as ldquoA is a Brdquo and ldquoA and Brdquowe label them as negative examples since ldquoA is a Brdquo is ldquoIS Ardquorelationship and ldquoA and Brdquo is a coordination relationshipand they are not the relationship we need When the processwas completed we estimated the level of agreement Cohenrsquoskappa [22] score between each annotator as shown in Table 4is 08664 and content analysis researchers generally think ofa Cohenrsquos kappa score more than 08 as good reliability [22]Comparably Light et al achieved their highest kappa value(068) in their manual annotation experiments on Medlineabstracts [16]

As shown in Figure 4 the process of training AB modelis as follows at first two initial SVM classifiers M0

1 (graph-based kernel) and M0

2 (feature-based kernel) are trainedusing initial training set Tinitial which contains 500 labeledexamples and the test results of two classifiers are as followswith classifier M0

1 the values of precision recall and F-scoreare 7288 8333 and 7776 respectively And at the sametime the results of classifier M0

2 are 7435 7633 and7533 respectively

In the next step 2000 unlabeled sentences (a section ofthe unlabeled corpus) are predicated by classifiers M0

1 andM0

2 respectively The classifier M01 selects the top 200 scores

(10of 2000 results classified byM01) as the positive examples

and the bottom 200 scores as the negative examples then weput these 400 newly labeled examples into the data set Tinitial(500) as a new training set T1

2 (900 examples) And at the sametime the classifierM0

2 selects the top 200 scores (10 of 2000results classified by M0

2) as the positive examples and thebottom 200 scores as the negative examples thenwe put these400 newly labeled examples into the data set Tinitial (500) as anew training set T1

1 (900 examples) Then two new classifiersM1

1 and M12 are trained from T1

1 and T12 respectively After

that 3000 unlabeled sentences (a section of the unlabeled

3000

6000

unlabeledsentences

unlabeledsentences

2000 unlabeledsentences PredicatePredicate

Predicate Predicate

400 pieces of labeled data400 pieces of labeled data

1200 pieces of labeled data1200 pieces of labeled data

Tinitial Tinitial

M0

1 M0

2

T1

2

M5

1 M5

2

T5

2

T5

1

T1

1

M1

1 M1

2

Figure 4 The process of training AB model

corpus) are classified by M11 and M1

2 respectively M11 selects

the top 300 scores (10 of 3000 results classified by M11)

as positive examples and the bottom 300 scores as negativeexamples We use the 600 newly labeled examples to updatethe T1

2 and thenwe obtain a new training set T22 whileM

12 also

selects 600 newly labeled examples to update T11 and then we

obtain the new training set T21 Such a process is repeated for

a preset number of learning roundsIn our experiment the preset number of learning rounds

is five and the sizes of corresponding extended set are 20003000 4000 5000 and 6000 respectively And the size oftraining set increases from the initial 500 (Tinitial) to 900(T1

1 and T12) 1500 (T2

1 and T22) 2300 (T3

1 and T32) 3300

(T41 and T4

2) and 4500 (T51 and T5

2) respectively Thereare two reasons why we set five times of learning roundsFirst after five learning rounds there are no extra unlabeledsentences At first 95 of the corpus (20895) is used forextending the training set and after five times of expansionof the training set there are no extra unlabeled data And thesecond reason is the cotraining process could not improvethe performance further after a number of rounds [21] andwe should terminate cotraining on an appropriate round toavoid some wasteful learning rounds

The best testing result is from the second learning round(the classifier is M2

1 (graph kernel) and the training set is

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

6 BioMed Research International

Table 4 Annotation values of two annotators

Reviewer 2 Reviewer 1Positive Negative Kappa score

Positive 385 22Negative 43 550Ours 08664Light et al [16] 068

number of labeled examples is usually small [21] so we selectabout 5 of the corpus as training set and test set and theother 95 of the corpus is used for extending the training setlater

During the manual annotation the following criteria aremet if a sentence contains a relationship belonging to thesemantic type list (see Table 1) between two entity conceptswe think there is a positive correlation between the two enti-ties and label the sentence as a positive example In additionsome special relationships such as ldquoB in Ardquo and ldquoA can changeBrdquo are also classified as positive examples since they mean aphysiological phenomenon (B) occurs when someone has thedisease (A) If there is no obvious relationship in a sentenceand only a cooccurrence relationship we label it as a negativeexample For the patterns such as ldquoA is a Brdquo and ldquoA and Brdquowe label them as negative examples since ldquoA is a Brdquo is ldquoIS Ardquorelationship and ldquoA and Brdquo is a coordination relationshipand they are not the relationship we need When the processwas completed we estimated the level of agreement Cohenrsquoskappa [22] score between each annotator as shown in Table 4is 08664 and content analysis researchers generally think ofa Cohenrsquos kappa score more than 08 as good reliability [22]Comparably Light et al achieved their highest kappa value(068) in their manual annotation experiments on Medlineabstracts [16]

As shown in Figure 4 the process of training AB modelis as follows at first two initial SVM classifiers M0

1 (graph-based kernel) and M0

2 (feature-based kernel) are trainedusing initial training set Tinitial which contains 500 labeledexamples and the test results of two classifiers are as followswith classifier M0

1 the values of precision recall and F-scoreare 7288 8333 and 7776 respectively And at the sametime the results of classifier M0

2 are 7435 7633 and7533 respectively

In the next step 2000 unlabeled sentences (a section ofthe unlabeled corpus) are predicated by classifiers M0

1 andM0

2 respectively The classifier M01 selects the top 200 scores

(10of 2000 results classified byM01) as the positive examples

and the bottom 200 scores as the negative examples then weput these 400 newly labeled examples into the data set Tinitial(500) as a new training set T1

2 (900 examples) And at the sametime the classifierM0

2 selects the top 200 scores (10 of 2000results classified by M0

2) as the positive examples and thebottom 200 scores as the negative examples thenwe put these400 newly labeled examples into the data set Tinitial (500) as anew training set T1

1 (900 examples) Then two new classifiersM1

1 and M12 are trained from T1

1 and T12 respectively After

that 3000 unlabeled sentences (a section of the unlabeled

3000

6000

unlabeledsentences

unlabeledsentences

2000 unlabeledsentences PredicatePredicate

Predicate Predicate

400 pieces of labeled data400 pieces of labeled data

1200 pieces of labeled data1200 pieces of labeled data

Tinitial Tinitial

M0

1 M0

2

T1

2

M5

1 M5

2

T5

2

T5

1

T1

1

M1

1 M1

2

Figure 4 The process of training AB model

corpus) are classified by M11 and M1

2 respectively M11 selects

the top 300 scores (10 of 3000 results classified by M11)

as positive examples and the bottom 300 scores as negativeexamples We use the 600 newly labeled examples to updatethe T1

2 and thenwe obtain a new training set T22 whileM

12 also

selects 600 newly labeled examples to update T11 and then we

obtain the new training set T21 Such a process is repeated for

a preset number of learning roundsIn our experiment the preset number of learning rounds

is five and the sizes of corresponding extended set are 20003000 4000 5000 and 6000 respectively And the size oftraining set increases from the initial 500 (Tinitial) to 900(T1

1 and T12) 1500 (T2

1 and T22) 2300 (T3

1 and T32) 3300

(T41 and T4

2) and 4500 (T51 and T5

2) respectively Thereare two reasons why we set five times of learning roundsFirst after five learning rounds there are no extra unlabeledsentences At first 95 of the corpus (20895) is used forextending the training set and after five times of expansionof the training set there are no extra unlabeled data And thesecond reason is the cotraining process could not improvethe performance further after a number of rounds [21] andwe should terminate cotraining on an appropriate round toavoid some wasteful learning rounds

The best testing result is from the second learning round(the classifier is M2

1 (graph kernel) and the training set is

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

BioMed Research International 7

0753 076 07620767

0767 07680778 0773 0778 0774

0764 0764

0811 0817 0813

08330824 083

0862 08620874 0874 0872

0897

075

077

079

081

083

085

087

089

Initial 1st 2nd 3rd 4th 5th

F-s

core

Learning round

Feature based kernel of AB modelFeature based kernel of BC model Graph-based kernel of AB model Graph-based kernel of BC model

Figure 5 The results of training AB and BC models of everylearning round by cotraining algorithms

T21) an F-score of 778 is achieved as shown in Figure 5

After five rounds of cotraining learning although there arefluctuations the F-score of feature-based kernel achieves acontinuous improvement and finally reaches 7682

We can know from both figures that the best results oftwo models (feature-based model and graph kernel-basedmodel) are all better than the results of the model (M0

1 andM0

2) only trained by the initial training set (Tinitial)The resultsshow that it is feasible to adopt a cotraining algorithm toimprove the effectiveness of extracting relationships froma sentence Meanwhile since the model based on graphkernel outperforms the feature-based model we choose theformer as our ABmodel in the following experiments In ourexperiments we choose M2

1 as the AB model

43 Training BC Model The BC model is used to determinewhether there exists an entity (target term) having physiolog-ical effects (linking term) on human beings in a sentenceTheprocess of training the BC model is similar to the process oftraining the AB model

At first as with the steps in the process of trainingthe AB model we obtain 20490 unlabeled sentences fromSemMedDB as corpus and we randomly choose 1000 sen-tences (probably 5 of the corpus) to construct the initialtraining set Tinitial (500) and test set (500) The process oftraining BC model is the same to the process of trainingthe AB model The preset number of learning rounds is fiveand the sizes of corresponding extended set are 2000 30004000 5000 and 6000 respectively

The best result is achieved in the fifth cotraining learninground (the classifier is M5

2(graph kernel) and the training set

is T52) and the value of F-score is 8971 as shown in Figure 5

And the highest F-score (833) of feature-based kernel isachieved in the third cotraining learning round

Like the results of the AB model the best results of twomodels (feature-basedmodel and graph kernel-basedmodel)are all better than the results of the model (M0

1 and M02)

only trained by the initial training set (Tinitial) It provesthe validity of cotraining method again Meanwhile sincethe model based on graph kernel outperforms the feature-based model we choose the former as our BC model in thefollowing experiments In our experiments we choose M5

1 asthe BC model

In both experiments the best results of graph kerneloutperform feature-based model the reason is that not onlydo the features selected by graph kernel include most partsof the features adopted by feature kernel but also the graphkernel approach captures the information in unrestricteddependency graphs which combines syntactic analysis witha representation of the linear order of the sentence andconsiders all possible paths connecting any two vertices in theresulting graph

5 Hypothesis Generation and Discussion

After the AB and BC models have been built we combinethem to reconstruct the ABC model and validate the threeclassic Swansonrsquos hypotheses that is ldquoRaynaudrsquos disease andfish oilrdquo ldquomigraine andmagnesiumrdquo and ldquoAlzheimerrsquos diseaseand indomethacinrdquo We compare the results generated byour method with the results generated by using SemRepDatabase in closed discovery and open discovery processesrespectively In our study the result of hypothesis discoveryachieved by SemRep is presented as a benchmark

51 Closed Discovery Experiment In our study the hypoth-esis Raynaudrsquos disease and fish oil is used as an example toverify the effectiveness of our method First the initial termsare set to ldquoRaynaudrsquos diseaserdquo and ldquoRaynaudrsquos phenomenonrdquoand the target terms are set to ldquofish oilrdquo and ldquoeicosapentaenoicacidrdquo (an important active ingredient of fish oil) The initialterms are used as keywords to retrieve all the sentences fromtheMedline Database respectively (the time is limited before1986 since Swanson found the classic hypothesis in 1986 andfrom then on the hypothesis is considered public knowledge)Then we obtain all the sentences that contain either the initialterms or target terms and then all the sentences are processedby MetaMap

Secondly we filter out the sentences which do not containany concept belonging to semantic type list or contain theconcepts in the general concept list Then the AB model isadopted to classify all the sentences containing the initialterms and at the same time we use BCmodel to classify all thesentences containing the target terms and finally we obtainthe effective linking terms by taking the intersection of twosets of positive examples (one set is from the AB model andthe other is produced by the BC model)

The only difference between the SemRep method andour method is that its effective linking terms are obtainedby taking the intersection of two sets of linking terms (thetwo sets of linking terms are all retrieved from SemRepinstead of using the AB model and BC model and one setis retrieved with the initial terms and the other is retrievedwith the target terms) Figure 6 shows the result obtained bythe twomethods In addition the other two classic Swansonrsquos

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

8 BioMed Research International

Table 5 The number and scale of the useful linking concept

Hypothesis Methods Effective linkingtermslinking terms Ratio

Raynaudrsquos disease and fish oils Our method 10106 943SemRep 3132 22

Migraine and magnesium Our method 67297 2256SemRep 10187 535

Alzheimerrsquos disease and indomethacin Our method 47251 1872SemRep 987 1034

Diabetic

Raynauddisease

Fish oils

Physiological reperfusion

Mediator brand of benfluorex hydrochloride

Blood viscosity

AutoimmuneEncounter due to therapyMental concentration

Adverse event associated with vascular

Antimicrobial susceptibility

Equilibrium

Figure 6 The result of closed discovery of ldquoRaynaudrsquos disease and fish oilsrdquo The purple nodes in the figure are the linking terms discoveredonly by our method the yellow nodes represent the linking terms found only by SemRep Database and the blue node represents the termsfound by both methods

hypotheses migraine and magnesium and Alzheimerrsquos diseaseand indomethacin are also verified with our method and theresults are shown in Table 5 and part of the effective linkingterms discovered by our method are shown in Table 6

As can be seen from Table 5 compared with the resultsobtained by SemRep method our method significantlyimproves the number and proportion of effective linkingterms The ratios of our method are much higher thanthose of SemRep method and more effective linking termsmean more possible useful hypothesis is generated Forexample Figure 6 shows the potential principles in treatmentof ldquoRaynaudrsquos disease with fish oilsrdquo discovered by twomethodsThe purple nodes in the figure are the linking termsdiscovered only by our method the yellow nodes representthe linking terms found only by SemRep Database and theblue node represents the terms found by both methodsFigure 6 shows that our method not only found the linkingterm ldquoblood viscosityrdquo also found by SemRep but also foundmore effective linking terms For example the intermediatepurple node ldquoadverse event associated with vascularrdquo wasfound by our approach from the abstracts [PMID 2340647]and [PMID 6982164] From [PMID 2340647] our approachextracted ldquodietary fish oil inhibits vascular reactivityrdquo and atthe same time from [PMID 6982164] we extracted ldquovasodi-lation inhibits Raynaud syndromerdquo then we may make the

hypothesis that dietary fish oil treats Raynaudrsquos syndrome byinhibiting vascular reactivity (ie vasoconstriction vascularconstriction (function)) which causes Raynaudrsquos disease andthis hypothesis has been verified inmedical experiments [23]

The reason why our method can discover more effectivelinking terms than SemRep is that compared with grammarengineering-based approaches like SemRep machine learn-ing based models usually can achieve better performancein information extraction from texts In our experimentsthe performances of our machine learning based AB andBC models are much better than that of SemRep whichmeans our method can return more effective linking termsand therefore it may help find more potential principles inhypothesis generation

52 Open Discovery Experiment In this paper the process offinding the hypothesismigraine and magnesium is used as anillustration to explain the procedure of open discovery

The processes of open discovery with different methodsare shown in Figures 7 and 8 respectively In the processof open discovery by using SemRep Database (Figure 7)first the initial terms (A terms) are specified by providingconcept ldquoMigraine Disorderrdquo which is used as a keyword toretrieve data from the SemRep Database Then we obtainall the sentences which contain both initial terms (Migraine

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

BioMed Research International 9

Table 6 Part of the effective linking terms discovered with our method

Raynaudrsquos disease and fish oils Migraine and magnesium Alzheimerrsquos disease and indomethacin

Blood viscosityAdverse event associated with vascularMental concentrationComponent of proteinAutoimmuneMediator brand of benfluorexhydrochloridePhysiological reperfusion

ArteriospasmVasodilationDesiccated thyroidHomovanillateBeta-adrenoceptor-activityRegional blood flowRecognition (psychology)Container status identifiedHydroxide ionMuscular dystrophy

Metabolic ratesPituitary hormone-preparationDesiccated thyroidCerebrovascular-circulationNormal blood pressureP blood group antibodiesDopamine hydrochlorideHLA-DR antigensPentose phosphate-pathwayToxic epidermal-necrolysis

SemRepfrom SemRep

filter sentences andobtain linking terms

Retrieve data fromSemRep Database

Set the initial terms

Retrieve sentences

Database

SemRepDatabase

Database

Ranking rank the target terms

Output the result

Filtering

Figure 7 The procedure of open discovery with SemRep Database

Disorder) and linking terms (other entities) Second we filterall the sentences obtained from step one and the specificrules are the same as we mentioned in closed discovery (wefilter out the sentences which do not contain any conceptbelonging to semantic type list or contain the concepts inbroad concept list) The third step is similar to the first stepall the linking terms (extracted from every sentence fromthe filter step) are used as keywords to retrieve data fromthe SemRep Database then we get all the sentences whichcontain both linking terms (B terms) and target terms (C

Set the initial terms

MedlineRetrieve sentencesfrom Medline

Database Database

Medline DatabaseMedlineDatabase

Classified by ABmodel

filter sentences andobtain linking terms

Retrieve data from

Classified byBC model

Ranking rank the target terms

Output the result

Filtering

Figure 8 The procedure of open discovery with our method

terms) At last we rank all the target terms and output theresults The scoring rules are applied to rank the target terms[24]

The process of open discovery with the AB-BC modelis shown in Figure 8 The differences between the two

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

10 BioMed Research International

Table 7 Result of open discovery

Initial terms SemRep Our method Combined result setRaynaud disease and eicosapentaenoic acid 68380 2305762 61246Migraine disorders and magnesium 186535 975349 25297Alzheimerrsquos disease and indomethacin mdash650 mdash2639 mdash275Alzheimerrsquos disease and indoprofen 373650 2502639 38275

processes are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 8) two steps are applied in ourmethod to obtain the data first we obtain raw sentences fromtheMedlineDatabase and in the second step all the sentencesare classified by eitherABmodel or BCmodel thenwe collectthe set of positive example as experimental data Anotherprocessing (eg filtering and ranking) is the same with thatof the SemRep method

The processing of open discovery with AB-BC modelis shown in Figure 8 The differences between the twoprocesses are shown in the blue dashed boxes (Figure 7)and yellow dashed boxes (Figure 8) Instead of obtainingall the sentences from the SemRep Database (as shown inblue dashed boxed in Figure 7) two steps are applied inour method to obtain the data first we obtain raw sentencesfrom the Medline Database and in the second step all thesentences are classified by either ABmodel or BCmodel thenwe collect the set of positive examples as experimental dataAnd other procedures (filtering and ranking) are the samewith the SemRep method

The results of open discovery are shown in Table 7The value of the number to the right of the slash is thetotal number of potential target terms we obtained andthe number to the left of the slash is the ranking of thetarget term we really want to obtain For example from therediscovery of migraine-magnesium association we obtain535 potential target terms with SemRep method and theranking of magnesium is 186 while 5349 target terms arediscovered with our method and the ranking of magnesiumis 97 As can be seen from the above results our method notonly can obtain more potential target terms but also can gethigher ranking of the real target terms

The higher the ranking of the real target terms isthe more valuable the results become And more potentialtarget terms mean more clues about the pathophysiologyof disease for example from the rediscovery of Alzheimerrsquosdisease and indomethacin hypothesis 650 and 2639 potentialtarget terms are found by SemRep method and our methodrespectively Although neither of two methods finds the realtarget term indomethacin (indomethacin is a nonsteroidalanti-inflammatory drug (NSAID) commonly used as a pre-scription medication to reduce fever pain stiffness andswelling) bothmethods find another NSAID indoprofen andmany epidemiological studies have supported the hypothesisthat chronic intake of NSAID is associated with a reducedrisk of Alzheimerrsquos disease [25ndash27] Moreover many otherNSAIDs have been found by our method such as carprofenProxen and aspirin and the result of our method finds

Table 8 More target terms about Alzheimerrsquos disease

Target terms Alzheimerrsquos diseaseAsprin 13702639Thromboxane 14552639Carprofen 22912639Prostaglandins 24882639

more clues about the pathophysiology of Alzheimerrsquos diseaseprostaglandins [28ndash30] and thromboxane [31] The detailedresults are shown in Table 8

The results of the experiment with our method outper-form almost all the results of SemRepmethod In addition wefound that combining the results of two methods can furtherimprove the performance For example in the rediscoveryof physiopathological hypothesis of Raynaudrsquos phenomenonwe obtain 380 potential target terms with SemRep methodand 5762 potential target terms with our method Althoughour method finds many more potential target terms thanSemRep method sometimes the result of SemRep methodhas a higher ranking (68) than ours (230)

Therefore we took the following steps to combine theresults of twomethods first we obtain the result sets returnedby SemRep and ourmethods respectivelyThenwe retain theintersection of two result sets For a term in the intersectionsits score is set as the sum of its scores in the original result setsand ranked with its score The detailed results are also shownin Table 7 and it can be seen that the results from ldquocombinedresult setrdquo have higher ranking than both original methodsThe reason is that in the intersection set of two result setsmany potential target terms are eliminated and at the sametime the real target terms will not be eliminated since suchterms usually have real association with the initial term andtherefore will be retained by both methods

6 Conclusions

Biomedical literature is growing at an explosive speed andresearchers can form biomedical hypotheses through miningthis literature

In this paper we present a supervised learning basedapproach to generate hypotheses from biomedical literatureThe approach splits the classic ABC model into AB andBC models and constructs the two models with a super-vised learning method respectively The purpose of the ABmodel is to determine whether a physiological phenomenon(linking term) is caused by a disease (initial term) in asentence and the BC model is used to judge whether there

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

BioMed Research International 11

exists an entity (target term) having physiological effects(linking term) on human beings in a sentence Comparedwith the concept cooccurrence and grammar engineering-based approaches like SemRep machine learning based ABand BC models can achieve better performance in miningassociation between bioentities from texts

In addition the cotraining algorithm is introduced toimprove the performance of the two models Then throughcombining the two models the approach reconstructs theABC model and generates biomedical hypotheses from lit-erature

The experimental results on the three classic Swansonrsquoshypotheses show that our approach can achieve better perfor-mance than SemRep This means our approach can discovermore potential correlations between the initial terms andtarget terms and therefore it may help find more potentialprinciples for the treatment of certain diseases

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

This work is supported by the grants from the NaturalScience Foundation of China (nos 61070098 61272373 and61340020) Trans-Century Training Program Foundation forthe Talents by theMinistry of Education of China (NCET-13-0084) and the Fundamental Research Funds for the CentralUniversities (nos DUT13JB09 and DUT14YQ213)

References

[1] D R Swanson ldquoFish oil Raynaudrsquos syndrome and undiscov-ered public knowledgerdquo Perspectives in Biology ampMedicine vol30 no 1 pp 7ndash18 1986

[2] D R Swanson ldquoMigraine and magnesium eleven neglectedconnectionsrdquo Perspectives in Biology and Medicine vol 31 no4 pp 526ndash557 1988

[3] D R Swanson ldquoSomatomedin C and arginine implicit con-nections between mutually isolated literaturesrdquo Perspectives inBiology and Medicine vol 33 no 2 pp 157ndash186 1990

[4] M Weeber H Klein L T W de Jong-van den Berg and RVos ldquoUsing concepts in literature-based discovery simulatingSwansonrsquos Raynaud-fish oil andmigraine-magnesium discover-iesrdquo Journal of the American Society for Information Science andTechnology vol 52 no 7 pp 548ndash557 2001

[5] M Weeber R Vos H Klein L T W de Jong-van den BergA R Aronson and G Molema ldquoGenerating hypotheses bydiscovering implicit associations in the literature a case reportof a search for new potential therapeutic uses for thalidomiderdquoJournal of the AmericanMedical Informatics Association vol 10no 3 pp 252ndash259 2003

[6] P Srinivasan ldquoText mining generating hypotheses fromMED-LINErdquo Journal of the American Society for Information Scienceand Technology vol 55 no 5 pp 396ndash413 2004

[7] M Yetisgen-Yildiz and W Pratt ldquoUsing statistical andknowledge-based approaches for literature-based discoveryrdquo

Journal of Biomedical Informatics vol 39 no 6 pp 600ndash6112006

[8] D Cameron O Bodenreider H Yalamanchili et al ldquoA graph-based recovery and decomposition of Swansonrsquos hypothesisusing semantic predicationsrdquo Journal of Biomedical Informaticsvol 46 no 2 pp 238ndash251 2013

[9] X Hu X Zhang I Yoo X Wang and J Feng ldquoMininghidden connections among biomedical concepts from disjointbiomedical literature sets through semantic-based associationrulerdquo International Journal of Intelligent Systems vol 25 no 2pp 207ndash223 2010

[10] T Miyanishi K Seki and K Uehara ldquoHypothesis generationand ranking based on event similaritiesrdquo in Proceedings of the25th Annual ACM Symposium on Applied Computing (SACrsquo 10)pp 1552ndash1558 Sierre Switzerland March 2010

[11] D Hristovski C Friedman T C Rindflesch et al ldquoExploit-ing semantic relations for literature-based discoveryrdquo AMIAAnnual Symposium Proceedings vol 2006 pp 349ndash353 2006

[12] T Cohen D Widdows R W Schvaneveldt and T C Rind-flesch ldquoDiscovery at a distance farther journeys in predicationspacerdquo in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW rsquo12) pp218ndash225 IEEE October 2012

[13] C B Ahlers M Fiszman D Demner-Fushman F Langand T C Rindflesch ldquoExtracting semantic predications fromMEDLINE citations for pharmacogenomicsrdquo in Proceedings ofthe Pacific Symposium on Biocomputing pp 209ndash220 MauiHawaii USA January 2007

[14] M Krallinger F Leitner C Rodriguez-Penagos and A Valen-cia ldquoOverview of the protein-protein interaction annotationextraction task of BioCreative IIrdquo Genome Biology vol 9supplement 2 article S4 2008

[15] H Kilicoglu D Shin M Fiszman G Rosemblat and TC Rindflesch ldquoSemMedDB a PubMed-scale repository ofbiomedical semantic predicationsrdquo Bioinformatics vol 28 no23 pp 3158ndash3160 2012

[16] M Light X Y Qiu and P Srinivasan ldquoThe language ofbioscience facts speculations and statements in betweenrdquo inProceedings of the Workshop on Linking Biological LiteratureOntologies and Databases HLT-NAACL Ed Detroit MichUSA 2004

[17] D Klein and C D Manning ldquoAccurate unlexicalized parsingrdquoin Proceedings of the 41st Annual Meeting on Association forComputational Linguistics vol 1 pp 423ndash430 Association forComputational Linguistics 2003

[18] N Cristianini and J Shawe-Taylor An Introduction to SupportVector Machines and Other Kernel-Based Learning MethodsCambridge University Press New York NY USA 2000

[19] A Airola S Pyysalo J Bjorne T Pahikkala F Ginter and TSalakoski ldquoAll-paths graph kernel for protein-protein interac-tion extraction with evaluation of cross-corpus learningrdquo BMCBioinformatics vol 9 no 11 article 52 2008

[20] A Blum and T Mitchell ldquoCombining labeled and unlabeleddata with co-trainingrdquo in Proceedings of the 11th Annual Con-ference on Computational Learning Theory pp 92ndash100 1998

[21] W Wang and Z H Zhou ldquoAnalyzing co-training style algo-rithmsrdquo in Machine Learning ECML 2007 vol 4701 of LectureNotes in Computer Science pp 454ndash465 Springer BerlinGermany 2007

[22] J Carletta ldquoAssessing agreement on classification tasks thekappa statisticrdquo Computational Linguistics vol 22 no 2 pp249ndash254 1996

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

12 BioMed Research International

[23] R A DiGiacomo J M Kremer and D M Shah ldquoFish-oil dietary supplementation in patients with Raynaudrsquos phe-nomenon a double-blind controlled prospective studyrdquo TheAmerican Journal of Medicine vol 86 no 2 pp 158ndash164 1989

[24] J D Wren R Bekeredjian J A Stewart R V Shohet and H RGarner ldquoKnowledge discovery by automated identification andranking of implicit relationshipsrdquo Bioinformatics vol 20 no 3pp 389ndash398 2004

[25] P L McGeer and E G McGeer ldquoNSAIDs and Alzheimerdisease epidemiological animal model and clinical studiesrdquoNeurobiology of Aging vol 28 no 5 pp 639ndash647 2007

[26] J L Eriksen S A Sagi T E Smith et al ldquoNSAIDs and enanti-omers of flurbiprofen target 120574-secretase and lower A12057342 invivordquo Journal of Clinical Investigation vol 112 no 3 pp 440ndash449 2003

[27] SWeggen J L Eriksen P Das et al ldquoA subset of NSAIDs loweramyloidogenicA12057342 independently of cyclooxygenase activityrdquoNature vol 414 no 6860 pp 212ndash216 2001

[28] Y Avramovich T Amit and M B H Youdim ldquoNon-steroidal anti-inflammatory drugs stimulate secretion of non-amyloidogenic precursor proteinrdquo The Journal of BiologicalChemistry vol 277 no 35 pp 31466ndash31473 2002

[29] D L Simmons R M Botting and T Hla ldquoCyclooxygenaseisozymes the biology of prostaglandin synthesis and inhibi-tionrdquo Pharmacological Reviews vol 56 no 3 pp 387ndash437 2004

[30] J Rogers L C Kirby S R Hempelman et al ldquoClinical trial ofindomethacin in Alzheimerrsquos diseaserdquo Neurology vol 43 no 8pp 1609ndash1611 1993

[31] S Weggen M Rogers and J Eriksen ldquoNSAIDs smallmolecules for prevention of Alzheimerrsquos disease or precursorsfor future drug developmentrdquo Trends in Pharmacological Sci-ences vol 28 no 10 pp 536ndash543 2007

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Research Article Supervised Learning Based Hypothesis Generation …downloads.hindawi.com/journals/bmri/2015/698527.pdf · 2019. 7. 31. · rized that the hypothesis-generating approaches

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology