11
Proceedings of the 2018 EMNLP Workshop BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering, pages 79–89 Brussels, Belgium, November 1st, 2018. c 2018 Association for Computational Linguistics 79 Ontology-Based Retrieval & Neural Approaches for BioASQ Ideal Answer Generation Ashwin Naresh Kumar * , Harini Kesavamoorthy * , Madhura Das * , Pramati Kalwad * , Khyathi Raghavi Chandu, Teruko Mitamura and Eric Nyberg Language Technologies Institute, Carnegie Mellon University {anareshk,hkesavam,madhurad,pkalwad,kchandu,teruko,ehn}@cs.cmu.edu Abstract The ever-increasing magnitude of biomedi- cal information sources makes it difficult and time-consuming for a human researcher to find the most relevant documents and pin- pointed answers for a specific question or topic when using only a traditional search engine. Biomedical Question Answering systems au- tomatically identify the most relevant docu- ments and pinpointed answers, given an in- formation need expressed as a natural lan- guage question. Generating a non-redundant, human-readable summary that satisfies the in- formation need of a given biomedical question is the focus of the Ideal Answer Generation task, part of the BioASQ challenge. This pa- per presents a system for ideal answer genera- tion (using ontology-based retrieval and a neu- ral learning-to-rank approach, combined with extractive and abstractive summarization tech- niques) which achieved the highest ROUGE score of 0.659 on the BioASQ 5b batch 2 test. 1 Introduction In this paper, we describe our attempts to ad- dress the Ideal Answer Generation task of the sixth edition of the BioASQ challenge, 1 which is a large-scale semantic indexing and question an- swering challenge in the biomedical domain. In particular, the sub-task of Phase B of this an- nual challenge is to develop a system for query- oriented summarization. Traditionally, there are two classes of summarization techniques, each having their own merits and pitfalls: (1) extrac- tive and (2) abstractive. While extractive tech- niques patch relevant sentences together enabling them to generate grammatically robust summaries, they flounder on maintaining coherence and read- ability. On the contrary, abstractive techniques ex- tract relevant information from the original text, * denotes equal contribution 1 bioasq.org which is then used to generate a novel natural lan- guage summary. While abstractive techniques are more succinct and coherent, automatic text gener- ation is prone to grammatical error. This directly implies that extractive summarization techniques should perform well on automatic evaluation met- rics (such as ROUGE), but do less well on human evaluation measures which account for precision, repetition and readability. We explore the hypoth- esis that a combination of these techniques will provide better overall performance on the ideal an- swer task, when compared with either approach used in isolation. The dataset we use for development of the cur- rent work is released as a part of the sixth edi- tion of the annual BioASQ challenge (Tsatsaronis et al., 2012). The main categories of answers in this data include summary, factoid, list and yes/no. There are a total of 2,251 questions, each of which is accompanied by a list of relevant documents and a list of relevant snippets extracted from each of these documents. Our model is an extension to the highest ROUGE scoring model in the final test batch of the fifth edition of the BioASQ challenge (Chandu et al., 2017), which is based on Maximal Marginal Relevance (MMR) (Carbonell and Gold- stein, 1998). In addition, we attempted abstractive techniques that are scoped to improve the readabil- ity and coherence aspect of the problem. We made 4 submissions to the challenge. The paper is organized as follows: Section 2 de- scribes our overall system architecture and the im- plementation details. Experiments and results are discussed in Section 3 followed by conclusion and future work in 4. 2 System Architecture The main components of the QA pipeline are out- lined in Figure 1. As illustrated, the first step is pre-processing of the question to enrich it with

Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

Proceedings of the 2018 EMNLP Workshop BioASQ: Large-scale Biomedical Semantic Indexing and Question Answering, pages 79–89Brussels, Belgium, November 1st, 2018. c©2018 Association for Computational Linguistics

79

Ontology-Based Retrieval & Neural Approachesfor BioASQ Ideal Answer Generation

Ashwin Naresh Kumar∗, Harini Kesavamoorthy∗, Madhura Das∗, Pramati Kalwad∗,Khyathi Raghavi Chandu, Teruko Mitamura and Eric Nyberg

Language Technologies Institute, Carnegie Mellon University{anareshk,hkesavam,madhurad,pkalwad,kchandu,teruko,ehn}@cs.cmu.edu

Abstract

The ever-increasing magnitude of biomedi-cal information sources makes it difficult andtime-consuming for a human researcher tofind the most relevant documents and pin-pointed answers for a specific question or topicwhen using only a traditional search engine.Biomedical Question Answering systems au-tomatically identify the most relevant docu-ments and pinpointed answers, given an in-formation need expressed as a natural lan-guage question. Generating a non-redundant,human-readable summary that satisfies the in-formation need of a given biomedical questionis the focus of the Ideal Answer Generationtask, part of the BioASQ challenge. This pa-per presents a system for ideal answer genera-tion (using ontology-based retrieval and a neu-ral learning-to-rank approach, combined withextractive and abstractive summarization tech-niques) which achieved the highest ROUGEscore of 0.659 on the BioASQ 5b batch 2 test.

1 Introduction

In this paper, we describe our attempts to ad-dress the Ideal Answer Generation task of thesixth edition of the BioASQ challenge,1 which isa large-scale semantic indexing and question an-swering challenge in the biomedical domain. Inparticular, the sub-task of Phase B of this an-nual challenge is to develop a system for query-oriented summarization. Traditionally, there aretwo classes of summarization techniques, eachhaving their own merits and pitfalls: (1) extrac-tive and (2) abstractive. While extractive tech-niques patch relevant sentences together enablingthem to generate grammatically robust summaries,they flounder on maintaining coherence and read-ability. On the contrary, abstractive techniques ex-tract relevant information from the original text,

∗denotes equal contribution1bioasq.org

which is then used to generate a novel natural lan-guage summary. While abstractive techniques aremore succinct and coherent, automatic text gener-ation is prone to grammatical error. This directlyimplies that extractive summarization techniquesshould perform well on automatic evaluation met-rics (such as ROUGE), but do less well on humanevaluation measures which account for precision,repetition and readability. We explore the hypoth-esis that a combination of these techniques willprovide better overall performance on the ideal an-swer task, when compared with either approachused in isolation.

The dataset we use for development of the cur-rent work is released as a part of the sixth edi-tion of the annual BioASQ challenge (Tsatsaroniset al., 2012). The main categories of answers inthis data include summary, factoid, list and yes/no.There are a total of 2,251 questions, each of whichis accompanied by a list of relevant documents anda list of relevant snippets extracted from each ofthese documents. Our model is an extension tothe highest ROUGE scoring model in the final testbatch of the fifth edition of the BioASQ challenge(Chandu et al., 2017), which is based on MaximalMarginal Relevance (MMR) (Carbonell and Gold-stein, 1998). In addition, we attempted abstractivetechniques that are scoped to improve the readabil-ity and coherence aspect of the problem. We made4 submissions to the challenge.

The paper is organized as follows: Section 2 de-scribes our overall system architecture and the im-plementation details. Experiments and results arediscussed in Section 3 followed by conclusion andfuture work in 4.

2 System Architecture

The main components of the QA pipeline are out-lined in Figure 1. As illustrated, the first step ispre-processing of the question to enrich it with

Page 2: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

80

Figure 1: System Architecture

features derived from standard NLP techniquessuch as Part of Speech (POS) tagging, NamedEntity Recognition (NER) and Medical EntityRecognition (MER). Subsequently an ontology-based retrieval system is used to retrieve relevantsnippets for the question. The retrieved snippetsare combined with the given BioASQ snippets forthe question, and passed to the ranking module.The ranked snippets are then input to the sen-tence selection module from the existing OAQApipeline (Chandu et al., 2017), which implementsthe CoreMMR (Zechner, 2002) and SoftMMR al-gorithms, which use similarity measures to selectthe most relevant and least redundant snippets.The selected sentences are then passed to the sum-marization module, which produces the final sum-mary. Each of these modules is discussed in detailbelow.

2.1 Ontology-Based Information Retrieval

Although a large amount of biomedical text isavailable in resources such as NLM (NIH, 2018),it can be difficult to leverage in the absence of su-pervised or automatic labeling (annotation) of theunstructured text content. Our hypothesis is thatan Ontology-based retrieval module which utilizesentity and relation extraction techniques to repre-sent and compare the content of questions and can-didate answers can improve the recall of answer-bearing documents from unstructured sources.

Our goal is to develop a graphical model thatcan represent the content of the question and eachcandidate answer. The nodes in the graph repre-sent medical entities and the edges between themrepresent the relations between the entities. Weextract relations from the text and index theminto the graph based on previously-published work

(Abacha and Zweigenbaum, 2015). The base ar-chitecture for the Ontology-based Retrieval mod-ule is shown in Figure 2.

Figure 2: Graph Generation Pipeline

For every edge present in the graph, we store theID of the source abstract, along with the ordinalindex of the source sentence.

2.1.1 Relation ExtractionRelation Extraction (RE) is a technology used byan ontology-based retrieval system to capture thesemantic relations which exist between the namedentities mentioned in the text; both the entities andrelations are considered instances of a given set ofontological types. In practice, various NLP toolk-its are available for related tasks, such as depen-dency parsing, semantic role labeling, and subject-verb-object relation extraction.

However, common NLP tools aren’t easilyleveraged on biomedical text, due to dramatic dif-ferences in the structure and content of the sen-tences. There exist tools for relation extractionin sub-domains such as Bacteria (Duclos et al.,2007) and disease-cause ontologies (Schriml et al.,2011), but these methods heavily rely on the pres-ence of specific words or features at the sentencelevel, and cannot be easily scaled to general bio-text. Most neural methods for training relation ex-tractors require a large (O(106)) corpus of labelledexamples, which is not available for general bio-text (Yih et al., 2015). In order to explore the useof ontology-based retrieval, we developed a novel

Page 3: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

81

RE approach, which is described below. The basearchitecture for the RE module is depicted in Fig-ure 3.

The following 4 steps are employed for extract-ing relations from a sentence:

1. Noun Phrase Chunking: The sentence isparsed using the TreeTagger POS tagger (Schmid,1995) to obtain all the Noun Chunks that formthe potential nodes of the graph. For our pur-pose, the nodes of the graph are all Medical andNamed Entities. In order to perform this, the po-tential nodes are passed through a Medical EntityRecogniser (GRAM-CNN) (Zhu et al.) and theStanford NER (Manning et al., 2014) discardingthe chunks that are not recognized. For an exam-ple, let us consider the following sentence: ‘Ge-nomic microarrays have been used to assess DNAreplication timing in a variety of eukaryotic organ-isms.’ which extract the following noun chunks:‘Genomic Microarrays’, ‘DNA Replication Tim-ing’ and ‘Eukaryotic Organisms’.2. Relation Extraction: This step comprises of 2sub parts.(2a) RE using Predicate Argument Structures:The Predicate Argument Structure (PAS) for thesentence, obtained using the Enju parser (Miyaoet al., 2008), is further parsed in order to obtainpossible relations for the graph. Possible relationsare those that contain arguments related through averb or a preposition.(2b) RE transformation through transitivity:Transitivity is performed on relations obtainedfrom the Enju parser in order to ensure that thearguments of the relations represent medical ornamed entities in the graph. The potential nodesare passed through the NER and MER. Nodes thatare not tagged or recognized by either undergo atransitive transformation to give way to new re-lations. For the example mentioned, the follow-ing relations are formed post transitive formations:‘Microarrays assess Timing’, ‘Timing in Organ-isms’ and ‘Microarrays in Organisms.3. Mapping to CUI: As the same medical en-tity can be represented in many forms, we em-ploy a mapping to the Concept Unique Identifier(CUI) from the UMLS Metathesaurus (Bodenrei-der, 2004) using the python wrapper for MetaMapcalled pyMetamap (Aronson, 2001). For each ofthe Noun Chunks present, the UMLS Metathe-saurus is queried to check if a CUI is present. Ifnot, the individual CUIs are obtained for everyword forming the noun chunk and the following

rules are employed in order to form a hierarchicalnode structure. For the Noun chunks obtained inthe example, CUIs are directly available for ‘DNAReplication Timing’ and ‘Eukaryotic Organism’and not for ‘Genomic Microarray’. To build thetree structure for this node, the CUIs for ‘Microar-ray’ and ‘Genomic’ are individually obtained andsince the latter is an adjective, the former becomesthe child of the latter. The final CUIs obtained andthe CUI node tree structure for ‘Genomic Microar-rays’ are depicted in Table 4 and Figure 6 respec-tively. For forming relations, the child nodes of alltrees are used for connecting edges.4. Relation Formation: As the arguments inthe relations obtained through PAS are the basenoun forms that do not represent the whole NounChunk, they are expanded to form the whole nounchunk. For the example, the relations obtainedin 2b are expanded using the noun chunks toform the final relations as follows: ‘Genomic Mi-croarrays assess DNA Replication Timing’, ‘DNAReplication Timing in Eukaryotic Organisms’ and‘Genomic Microarrays in Eukaryotic Organisms’.The entities in the relation are mapped to theirCUI based representation to form a complete re-lation ready for insertion or retrieval from thegraph. The mapped relations for the exam-ples look as follows: ‘C1709016, C0887950 as-sess C1257780’, ‘C1257780 in C0684063’ and‘C1709016, C0887950 in C0684063’. Here, theroot and children nodes are comma separated. Thefinal graph structure for the relations is depicted inFigure 4.

Figure 4: Graph obtained by relation extraction.

In order to index the graph, the relation extrac-tion process specified above is utilized. The edgesof the graph store additional information such asthe abstract ID and the sentence offset in the ab-stract. In order to form a relation for a new queryto retrieve information from the graph, a back-offmechanism is employed as follows.1. Form relations using the process for indexing.2. Query with all medical entities and obtain all

Page 4: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

82

Figure 3: Architecture of Relation Extraction Module

abstracts between pairs of them.3. Query the medical entities with the verb formsassociated with them.4. Query with just the medical entity and obtainall abstracts related to each of the medical entities.

2.1.2 Graph CreationGraph Framework: All PubMed abstracts aretokenized and relations are extracted from them.These are added as relations in the graph. Wecreate custom data structures for the Nodes andEdges(Relations) in Neo4j (Webber, 2012). Everyrelation has attributes which are comma separatedvalues of PubMed ID, location within the abstract.This is stored in order to retrieve the exact sen-tence that was used to create a particular relation.We hypothesize that this can improve in gettingrelevant snippets across the abstracts.Phase I: Ontology Creation with UMLS Con-cepts. Part of speech tagging is the most intuitiveway of approaching the problem of extracting therelations from a given text. An initial strategy offorming the Subject Verb Object (SVO) tripletswas formed based on a left-right parsing of thetext. For this purpose, an off-the-shelf POS tag-ger (Schmid, 1994) was used. This is not an ef-fective method to create a graph as it misses veryimportant clauses and fails to recognize the NounChunks in sentences. In order to overcome thelimitations of this, we form a UMLS based map-ping of the noun phrases to medical concept the-saurus as in the UMLS Meta Thesaurus. Once theCUI ids are mapped using metamap as in Section2.1.1, these are prepared to be indexed into thegraph. These relations are added into the graphby following standard database methods, i.e.,

• If the relation is already present in the graph,then the relation attribute is appended withthe PubMed ID, and offset in the abstract.

• Every node in the relation is first queriedfrom the graph and then the relation is addedbetween the nodes retrieved. If no node is re-trieved, then new nodes are created and the

relation is created.

Phase II: Adding transitivity to Relations. Thelimitations of Phase I was that despite accurate re-lations from the relation extractor, the mappingin the graph for different clauses joined togetherwith different prepositions was not solved. Wesolve it with the following method. First, a keyvalue pair is added as an attribute to the relation,where every key is a preposition and the value isthe NodeID noun chunk that is associated with thepreposition. For example, in the sentence “Ge-nomic microarrays have been used to assess DNAreplication timing in a variety of eukaryotic organ-isms”, the clause “in a variety of eukaryotic organ-isms” would be missed in the phase II of ontol-ogy creation. But in Phase II, we convert such thatthe verb “assess” has an attribute “{in, nodexyz}”where nodexyz is the node pertaining to the CUIof “eukaryotic organisms”.

2.2 Ranking

Information Retrieval is one of the essential com-ponents of a Question Answering pipeline. It willhelp provide relevant information to the pipelinefor more accurate answers. Ranking snippetsbased on relevance to the question will improvethe answer selection process and in turn give morerelevant answers. Employing Learning to Rank(LETOR)(Qin et al., 2010) methods to rank snip-pets should help rank snippets according to thequestions.The output of the Ontology based Information Re-trieval is a set of relevant snippets. We combinethe given BioASQ snippets along with the Ontol-ogy Retrieved snippets and rank them accordingto relevance to the question. The ranking algo-rithm finally gives a set of ranked snippets rele-vant to the question. There is a possibility that theOntology based retrieved snippets may also haveirrelevant snippets. To prevent the error from fur-ther propagating into the pipeline, we use a sim-ple BM25(Robertson and Zaragoza, 2009) scoringthreshold between the snippets and the question.

Page 5: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

83

We discard the snippets which have a BM25 scorelower than a certain threshold. LETOR is highlyfeature driven which necessitates a good amountof feature engineering. The explored features arelisted in the next section. In this paper, we haveexplored 2 LETOR approaches: a) RankSVM andb) A Listwise Neural Approach.

2.2.1 Feature EngineeringMultiple features have been explored as inputsto the LETOR framework. The features can bedivided into 3 major categories i.e. 1) Statisti-cal Features 2) Semantic Features and 3) Syn-tactic Features. The statistical features includedlength of snippets, BM25 score between query andsnippet, dot product of tf-idf between query andsnippet, cosine similarity over the TF-IDF vectors(along with log space representation), number ofbi-grams in the intersection of query and snippetsand Jaccard similarity score. The semantic fea-tures include averaged word2vec representationsacross snippets. The syntactic features are numberof medical entities and bag of words representa-tion of medical entities.

2.2.2 Quasi Ground Truth CreationAn initial challenge while formulating the LETORframework is the ground truth ranking of the snip-pets as that was not provided in the BioASQ train-ing data. The primary purpose of the LETORmodel was to rank more relevant snippets higherup to obtain a higher ROUGE score. Taking thisinto account, we decided to create the ground truthas the BM25 scores between the snippet and theideal answer. The scores were calculated accord-ing to the formula in 1. The snippets were rankedaccording to the scores for each question.

(1)RelevanceScore

= scoreBM25(idealanswer, snippet)

2.2.3 RankSVMRankSVM(Cao et al., 2006) is a pairwise LETORapproach towards ranking of documents. Eachpair of snippets was taken for a question and waslabeled as -1 if the second snippet was rankedhigher and +1 if the second snippet was rankedlower. In a pairwise approach there is an over-head of maintaining the metadata as we need toknow which set of snippets are going into theSVM as input for validation of the model. Con-sider F (Q,S1) as a feature representation of thequestion Q and snippet S1. Similarly, F (Q,S2)

is a feature representation of the question Q andsnippet S2. F (Q,S1) and F (Q,S2) are inputs tothe SVM and the SVM predicts a -1 or +1 accord-ing to the relative ranking of S1 and S2.

2.2.4 Neural Ranking ApproachThe second approach that we implemented wasa list-wise ranking approach inspired from List-Net(Cao et al., 2007). Every data point is a fea-ture representation between the question Q andnth snippet Sn. The neural network is trainedagainst the BM25 scores between the snippet andthe ideal answer. The architecture of the networkis a 2 layer MLP with ReLU activations. The finallayer is a linear layer of size 1.In an ideal scenario, where we would have had theground truth rankings of the snippets, it would beintuitive to use a probabilistic loss. In our case, aswe are using proxy golden ranks with the BM25score, it would be more intuitive for the model tolearn to estimate the scores instead of the relativeranking of the snippets. Hence, we use a RMSEloss as we want our model to estimate the BM25scores. The RMSE loss is calculated per ques-tion as we would want to learn the distributionof snippets with respect to a single question andnot across the complete dataset. The final rankingof the snippets are determined with respect to thescores the model predicts.

2.3 Summarization

Summarization is the final stage in the questionanswering pipeline. The ranked snippet sentencesfeed into the summarization module which finallyoutputs the ideal answers.For the case of ideal answer generation, two typesof summarization techniques can be employed;extractive and abstractive summarization. Extrac-tive summarization works by selecting the mostrelevant sentences in a document to generate thesummary (Allahyari et al., 2017). The summariesgenerated using this technique generally obtainhigh ROUGE scores (Lin, 2004) due to the highn-gram overlap between the generated summaryand the ideal answer. Abstractive summarizationon the other hand works by generating the sum-mary word by word as opposed to picking sen-tences in the case of extractive summarization.Recent advances in abstractive summarization us-ing Pointer Generator Coverage (PGC) networks(See et al., 2017) have shown that neural sequenceto sequence models can generate abstractive sum-

Page 6: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

84

maries which are readable and have high ROUGEscores. Given that these networks can gener-ate human readable summaries with high ROUGEscores, we decided to use this model as the idealanswer generation module in our question answer-ing pipeline. We show that the neural sequencemodel is able to generate ideal answers in thebiomedical domain. More concretely, we see thata pretrained PGC model can be transferred to thebiomedical domain to generate ideal answers i.e.,the model is able to handle new words (mainlybiomedical words) and not generate any unknowntags (<UNK>) in the summaries which wouldhinder the readability. We also show that fine tun-ing the model on the BioASQ data generates betteranswers in terms of the ROUGE scores.

3 Experiments and Results

This section describes the experiments conductedto evaluate each of the components in the ques-tion answering pipeline. All the experiments havebeen conducted on batch 2 of the fifth edition ofBioASQ and evaluated using the official Oracle2

developed by the organizers of the task.

3.1 Ontology based Information Retrieval

Ontology based retrieval has been evaluated forproviding summary answers to queries with zerosnippets provided. For example, the snippets re-trieved for the question, ’Does metformin interferethyroxine absorption?’ has a ROUGE of 0.2044compared with the ideal answer provided for it.

3.2 Ranking

3.2.1 RankSVM Feature AnalysisAblation studies were carried out with respect tothe features to determine which set of features giveus the best results.

It is seen that the statistical features contributedthe most to the model. Another interesting obser-vation from the graph is that even though BM25and log(BM25) were the top contributing features,the log(BM25) has a higher weight. This is mainlybecause the log scale is known to be more stableand therefore will help the model learn better.

3.2.2 Neural Ranking Approach AnalysisThe Neural approach was evaluated against theRankSVM results. We also added the syntac-tic features and did a comparison study on them.

2BioASQ Oracle

From Table 1 it is seen that adding the syntacticfeatures have contributed to an overall increase inthe ROUGE scores for both the models. Also, it isnoticed that the Neural model has performed bet-ter than the RankSVM. This is mostly due to thefact that the Neural approach is trying to estimatethe BM25 scores between the snippet and the idealanswer rather than trying to mimic the quasi rank-ing. From the discussion of the results above, wecan confirm the hypothesis that ranking snippets inan order of relevance will help improve the qualityof answers generated by the pipeline.

3.3 Summarization

Ranked snippet sentences from the rankingpipeline are fed into the summarization module.The following experiments were carried out:

1. Using the PGC network pretrained onCNN/Daily Mail to generate the ideal an-swers

2. Fine tuning the pretrained PGC network onBioASQ data

Table 1 gives the ROUGE scores obtained byboth the models on the BioASQ dataset. For themodel fine tuning, the pretrained model is fur-ther trained on BioASQ 5b training data. Wesee that the fine tuned model obtains much higherROUGE-2 and ROUGE-SU4 scores when com-pared to the pretrained model. This shows thatthe fine tuned model generates better answers thanthe pretrained model in terms of ROUGE score.On closely analyzing the answers generated by thePGC models, we see that there are no <UNK>sgenerated by both the pretrained and fine tunedmodels. The model is also able to effectively copythe unknown words from the biomedical sourcetext. A detailed error analysis for the answers gen-erated by the model is discussed in the upcomingsubsection 3.3.1.

3.3.1 Error AnalysisThis subsection discusses the analysis on the idealanswers generated by both the pretrained modeland fine tuned model for different question typesin the BioASQ dataset (Table 2). All the readabil-ity judgments made in this subsection are an in-dicator of the subjective judgments made by theauthors of this paper:

1. Yes/No type:

Page 7: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

85

RankSVM LETOR Framework ROUGE-2 ROUGE-SU4BM25, Cosine Similarity, TF-IDF dot product 0.6442 0.6453BM25, Cosine Similarity, TF-IDF dot product, word2vec dot product 0.6439 0.6460BM25, Cosine Similarity, TF-IDF dot product,word2vec dot product, Snippet Length, log(sum(TF-IDF)) 0.6471 0.6481Neural LETOR FrameworkNeural Approach 0.6565 0.6552Neural Approach + MER Bag-of-words 0.6590 0.6581Abstractive SummarizationPretrained PGC 0.29 0.3Fine tuned PGC 0.4640 0.4659

Table 1: Results on 5b batch 2

Question Ideal Answer Pre-trained model answer Fine tuned model answerDo pro-ton pumpinhibitorsaffect thy-roxineabsorption?(Yes-Notype)

Proton-pump inhibitors, antacids and a longlist of drugs may decrease thyroxine absorp-tion. Patients with hypothyroidism and nor-mal TSH values during LT4 replacement ther-apy may need additional thyroid function test-ing after treatment with PPIs and may needadjustment of their LT4 dose.

Ppis and a long list of drugs may decrease thyroxine absorption. Many commonly used drugs , such as bile acid sequestrants , fer-rous , phosphate binders , raloxifene and proton-pump inhibitors ,have also been shown to interfere with the absorption of levothy-roxine . Patients with hypothyroidism and normal tsh values duringshort-term treatment .

Antacids inhibitors , antacids and a long list of drugs may de-crease thyroxine absorption . Many commonly used drugs , suchas bile acid sequestrants , ferrous sulphate , calcium carbonate ,aluminium-containing antacids , phosphate binders , raloxifene andproton-pump inhibitors , have also been shown to interfere withthe absorption of levothyroxine . Pantoprazole did not influenceendocrine function in healthy male volunteers during short-termtreatment .

What is theubiquitinproteome?(FactoidType)

The ubiquitin proteome is the entire set ubiq-uitinated proteins and of their respectiveubiquitination sites.

Tandem repeated ubiquitin binding entities tubes under non-denaturing conditions followed by mass spectrometry analysis tostudy global ubiquitylation events that may lead to the identifica-tion of potential drug targets . To study the ubiquitin proteomewe have established an immunoaffinity purification method forthe proteomic analysis of endogenously ubiquitinated protein com-plexes .

Mass spectrometry now allows high throughput approaches for theidentification of the thousands of ubiquitinated proteins and of theirubiquitination sites . We used tandem repeated ubiquitin bindingentities tubes under non-denaturing conditions followed by massspectrometry analysis to study global ubiquitylation events thatmay lead to the identification of potential drug targets .

What mem-ory problemsare reportedin the "Gulfwar syn-drome"?(List Type)

Loss of memory and dysmnesia are memoryproblems reported in the Gulf war syndrome.Patients suffering from this syndrome oftenhave other nonspecific symptoms such as fa-tigue, skin rash, headache, muscle and jointpain and sexual dysfunction.

The strongest associations were for mood swings or 20.9. Thesymptoms include fincapacitating fatigue , musculoskeletel andjoint pains , headaches , muscle and joint pain . His history re-vealed several front-line operations in 1990 and 1991 during thegulf war .

The gulf war syndrome gws is the beginning of what was to beidentified as the gulf war syndrome . In early 1992 , u.s. troopsreturning from the gulf war began reporting a variety of nonspecificsymptoms such as fatigue , skin rash , muscle and joint pain , andloss of memory .

What is thetreatmentof acutepericarditis?(SummaryType)

A multidisciplinary approach is frequentlynecessary to treat acute pericarditis; the mostfrequent treatments are: antiinflammatorysteroid and non-steroid drugs, antibiotictherapy, pericardial drainage and, less fre-quently ,intrapericardial irrigation of fibri-nolytics; antituberculous chemotherapy inpresence of Tuberculous Agent

The incidence of purulent pericarditis has decreased since the an-tibiotic era . it is typically an acute and potentially lethal disease, necessitating rapid diagnosis and adequate therapy to improveprognosis . standard treatment combines appropriate antibiotictherapy with surgical drainage.

Purulent pericarditis has decreased considerably since the antibi-otic era . It is typically an acute and potentially lethal disease, necessitating rapid diagnosis and adequate therapy to improveprognosis . Standard treatment combines appropriate antibiotictherapy with surgical drainage .

What causeserucism?(SummaryType)

Erucism is defined as urtication by Lepi-doptera larvae. It is a skin reaction to en-venomation from certain poisonous caterpil-lar bristles. The hair on the dorsum of the lastinstar larvae of the moth may cause urticarialreactions (erucism) as well as eye problemsand temporary blindness.

Tropical insects carry infectious diseases , cutaneous injury can oc-cur by other mechanisms . the hair on the dorsum of the moth maycause urticarial reactions . erucism is defined as urtication by lep-idoptera larvae . erucism is defined as urtication by lepidopteralarvae .

Tropical insects carry infectious diseases , cutaneous injury canoccur by other mechanisms . The hair on the dorsum of the lastinstar larvae of the moth may cause urticarial reactions . Urticatingis defined as urtication by lepidoptera larvae . erucism is definedas urtication by lepidoptera larvae .

Table 2: Examples of error types observed in the qualitative analysis

Here, we see that the model generated an-swers address the question and also gives outextra facts not described in the ideal answer,but pertaining to the question. We see that theanswer given by the fine tuned model seemsmore complete than that of the pretrainedmodel as it mentions that antacids (whichcontain PPI) decrease thyroxine absorptionand also that they interfere with a specifictype of thyroxine, namely levothyroxine.

2. Factoid type:

Here, the generated summaries miss the an-swer as exact answer is not present evenin the snippets. Fine tuned model on theother hand, generates an answer more read-able than the pretrained model generated an-swer.For most of the other factoid questions, we

saw that the question was answered correctly,but the answer describes the facts very dif-ferently and also gives different extra factscompared to the ideal answer. Another obser-vation was that presence of several acronymsand abbreviations reduced the ROUGE score.

3. List type: Here, we see that the answer gen-erated by the fine tuned model is more read-able as there is a seamless flow in the an-swer where the answer starts off by explain-ing what the Gulf War Syndrome is and latergoes on to list the problems reported in theGulf War Syndrome. We also notice that thepretrained model misses the symptom ’lossof memory’ mentioned in the ideal answerwhich is however picked up by the fine tunedmodel.

4. Summary type: Here, the generated an-swers partially answer the question as both

Page 8: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

86

Test Batch System ROUGE 2 ROUGE SU4Batch 1 RankSVM 0.6372 0.6456Batch 3 Neural Ranking + Extractive Summarization 0.5743 0.5883Batch 4 Neural Ranking + Abstractive Summarization 0.4183 0.4281Batch 5 Relation extraction + Ontology + Neural Ranking + Abstractive Summarization 0.4573 0.4637

Table 3: BioASQ Results for Task 6B

of them mention the surgical drainage treat-ment which is a super set of the pericardialdrainage treatment. Other treatment types arehowever missed by both the models in theiranswers. This is mainly due to the fact thatthere is no direct mapping between the idealanswer and the snippets.

5. Summary type: Another error which occa-sionally surfaces is repetition. In the pre-trained model answer, we see that the lastsentence is repeated. In the fine tuned modelhowever, there is no repetition with respect tothe entire sentence. Although, a majority ofthe last sentence is repeated in the fine tunedmodel answer, we see that the term ’urtica-tion’ is defined in terms of its own verb formand urtication is later used to define erucism.

Table 3 comprises our results over the testbatches of the sixth edition. We believe the modelgave the best ROUGE scores for test batch 3 aswe have seen from the previous section that theneural model along with domain specific featuresperformed the best among all the models.

4 Conclusion and Future Work

This paper discusses our system for summary typeanswer generation using a knowledge graph and aneural learning to rank approach. The ranked snip-pets are further used to generate the answers us-ing extractive and abstractive summarization tech-niques. We also show that we can transfer theabstractive summarization knowledge from theCNN/Daily-Mail summarization task to the task ofbiomedical summarization.

From a brief manual inspection of the gener-ated summaries and their relevant documents, webelieve that from an NLP standpoint the follow-ing are some of the promising directions to ex-plore. Anaphora resolution would help providebetter relations. We also plan to use the ontologyindexing and retrieval system for factoid and listtypes of questions. Incorporating the question typeas contextual information while generating sum-maries could lead to improving precision. Instead

of a dual step of transfer learning with training andfine tuning on PGC network, the abstracts of thePUBMED articles and the entire document couldpotentially be leveraged to train the end to end net-work.

As an extension, we intend to pursue the follow-ing tasks for BioASQ:

• The current pipeline of the work includes theMMR algorithm while selecting sentences.Experimentation with other diversification al-gorithms like xQuAD(Santos et al., 2010)and PM-2(Dang and Croft, 2012) can be usedfor sentence selection.

• Exploration of more language model basedfeatures in the LETOR pipeline like Point-wise Mutual Information.

ReferencesAsma Ben Abacha and Pierre Zweigenbaum. 2015.

Means: A medical question-answering system com-bining nlp techniques and semantic web tech-nologies. Information processing & management,51(5):570–594.

Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi,Saeid Safaei, Elizabeth D Trippe, Juan B Gutier-rez, and Krys Kochut. 2017. Text summariza-tion techniques: A brief survey. arXiv preprintarXiv:1707.02268.

Alan R Aronson. 2001. Effective mapping of biomed-ical text to the umls metathesaurus: the metamapprogram. In Proceedings of the AMIA Symposium,page 17. American Medical Informatics Associa-tion.

Olivier Bodenreider. 2004. The unified medical lan-guage system (umls): integrating biomedical termi-nology. Nucleic acids research, 32(suppl_1):D267–D270.

Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, YalouHuang, and Hsiao-Wuen Hon. 2006. Adapting rank-ing svm to document retrieval. In Proceedings ofthe 29th annual international ACM SIGIR confer-ence on Research and development in informationretrieval, pages 186–193. ACM.

Page 9: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

87

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, andHang Li. 2007. Learning to rank: from pairwise ap-proach to listwise approach. In Proceedings of the24th international conference on Machine learning,pages 129–136. ACM.

Jaime Carbonell and Jade Goldstein. 1998. The use ofmmr, diversity-based reranking for reordering doc-uments and producing summaries. In Proceedingsof the 21st annual international ACM SIGIR confer-ence on Research and development in informationretrieval, pages 335–336. ACM.

Khyathi Chandu, Aakanksha Naik, Aditya Chan-drasekar, Zi Yang, Niloy Gupta, and Eric Nyberg.2017. Tackling biomedical text summarization:Oaqa at bioasq 5b. BioNLP 2017, pages 58–66.

Van Dang and W. Bruce Croft. 2012. Diversity by pro-portionality: An election-based approach to searchresult diversification. In Proceedings of the 35thInternational ACM SIGIR Conference on Researchand Development in Information Retrieval, SIGIR’12, pages 65–74, New York, NY, USA. ACM.

Catherine Duclos, Jérome Nobécourt, Gian Luigi Car-tolano, Anis Ellini, and Alain Venot. 2007. Anontology of bacteria to help physicians to compareantibacterial spectra. In AMIA Annual SymposiumProceedings, volume 2007, page 196. AmericanMedical Informatics Association.

Tadayoshi Hara, Yusuke Miyao, and Jun-ichi Tsujii.2010. Evaluating the impact of re-training a lexicaldisambiguation model on domain adaptation of anhpsg parser. In Trends in Parsing Technology, pages257–275. Springer.

Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation of summaries. Text SummarizationBranches Out.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Yusuke Miyao, Rune Sætre, Kenji Sagae, Takuya Mat-suzaki, and Jun’ichi Tsujii. 2008. Task-orientedevaluation of syntactic parsers and their representa-tions. Proceedings of ACL-08: HLT, pages 46–54.

NIH. 2018. National Library of Medicine.

Tao Qin, Tie-Yan Liu, Jun Xu, and Hang Li. 2010.Letor: A benchmark collection for research on learn-ing to rank for information retrieval. InformationRetrieval, 13(4):346–374.

Stephen Robertson and Hugo Zaragoza. 2009. Theprobabilistic relevance framework: Bm25 and be-yond. Found. Trends Inf. Retr., 3(4):333–389.

Rodrygo LT Santos, Jie Peng, Craig Macdonald, andIadh Ounis. 2010. Explicit search result diversifica-tion through sub-queries. In European Conferenceon Information Retrieval, pages 87–99. Springer.

Helmut Schmid. 1994. Probabilistic pos tagging usingdecision trees. In Proceedings of International Con-ference on New methods in Language Processing.

Helmut Schmid. 1995. Treetagger| a languageindependent part-of-speech tagger. Institut fürMaschinelle Sprachverarbeitung, UniversitätStuttgart, 43:28.

Lynn Marie Schriml, Cesar Arze, Suvarna Nadendla,Yu-Wei Wayne Chang, Mark Mazaitis, Victor Felix,Gang Feng, and Warren Alden Kibbe. 2011. Diseaseontology: a backbone for disease semantic integra-tion. Nucleic acids research, 40(D1):D940–D946.

Abigail See, Peter J Liu, and Christopher D Man-ning. 2017. Get to the point: Summarizationwith pointer-generator networks. arXiv preprintarXiv:1704.04368.

George Tsatsaronis, Michael Schroeder, GeorgiosPaliouras, Yannis Almirantis, Ion Androutsopoulos,Eric Gaussier, Patrick Gallinari, Thierry Artieres,Michael R Alvers, Matthias Zschunke, et al. 2012.Bioasq: A challenge on large-scale biomedical se-mantic indexing and question answering. In AAAIfall symposium: Information retrieval and knowl-edge discovery in biomedical text.

Jim Webber. 2012. A programmatic introduction toneo4j. In Proceedings of the 3rd annual conferenceon Systems, programming, and applications: soft-ware for humanity, pages 217–218. ACM.

Scott Wen-tau Yih, Ming-Wei Chang, Xiaodong He,and Jianfeng Gao. 2015. Semantic parsing viastaged query graph generation: Question answeringwith knowledge base.

Klaus Zechner. 2002. Automatic summarization ofopen-domain multiparty dialogues in diverse genres.Computational Linguistics, 28(4):447–485.

Qile Zhu, Xiaolin Li, and Ana Conesa. Gram-cnn: adeep learning approach with local context for namedentity recognition in biomedical text. Bioinformat-ics, page btx815.

Page 10: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

88

Supplementary Material

For the example provided in section 2.1.1, thePredicate Argument Structure provided by Enju isshown in Figure 5, which forms the following re-lations: ‘Microarrays assess Timing’, ‘Assess invariety’ and ‘variety of organisms’.

Table 4: CUI MappingConcept CUI

Microarray C1709016Genomic C0887950

DNA Replication Timing C1257780Eukaryotic Organism C0684063

Figure 6 shows the node structure that is builtfrom the CUIs for Genomic Microarray.

Figure 6: CUI node structure for Genomic Microarray

The efficacy of the Relation Extraction moduledepends on the tools it utilizes. The Enju parsertrained using the GENIA corpus has an F-score of90.15 on the same (Hara et al., 2010). GRAM-CNN has an F1-score of 87.26% on the Biocre-ative II dataset, 87.26% on the NCBI dataset and72.57% on the JNLPBA dataset (Zhu et al.). In ad-dition to the hierarchical node structure, these canalso lead to the existence of incorrect nodes andedges in the ontology. Every node is associatedwith all abstracts containing a mention of them.This results in the possibility of the retrieval logicreturning all abstracts containing just a mention ofthe medical/named entity present in a query, ratherthan only the relevant abstracts for that particularquery. The Ranking module filters these abstractsto obtain those most relevant to the query.

We also graphed out the top contributing fea-tures for our RankSVM model. Figure 7 displaysthe contribution of the top 6 features which wereused in the model. The factors by which the top6 features contributed were then normalized andplotted. The top contributing features are depictedin Figure 7.

Figure 7: RankSVM Top Contributing Features

Page 11: Ontology-Based Retrieval & Neural Approaches for BioASQ ... · 2.1 Ontology-Based Information Retrieval Although a large amount of biomedical text is available in resources such as

89

Figure 5: Predicate Argument Structure