64
Applying Natural Language Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755 Individual Project Supervised by: Zhenchang Xing The Australian National University June 2020 c Jingjing Shi 2020

Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Applying Natural LanguageInference or Question Entailment

for Crowdsourcing More Data

Jingjing Shi

A report submitted for the course COMP8755Individual Project

Supervised by Zhenchang XingThe Australian National University

June 2020ccopy Jingjing Shi 2020

Except where otherwise indicated this thesis is my own original work

Jingjing Shi12 June 2020

Acknowledgments

bull Foremost I would like to express my sincere gratitude to Dr Zhenchang Xingand Mr Vincent Nguyen for their patient guidance continuous support andvaluable help during the past two semesters You broadened my knowledge ofnatural language processing and helped me to achieve a deeper understanding

bull I would also like to thank Prof Weifa Liang the course convener of this courseto provide significant help throughout this year for the presentations and thesiswriting

bull Last but not least I would like to extend my thanks to my parents and friendsWithout your encouragement patience and support I will not be the one whofinishes this work You all are the best

iii

Abstract

Question answering (QA) is widely investigated in the open domain However inthe specific domain such as patient question answering it is less studied Severalchallenges exist in the data sets of this domain such as lack of high-quality data setlimited annotated data and the content is not close to reality To address some ofthese issues we propose a novel way to crowdsource data for this particular domain

In our approach the data we use are from Health24 This is an interactive plat-form where the patients can ask their quires through it and the experts can answerthem accordingly However not every question is answered So our goal is tofind the answers from the existing answers to answer these unanswered questionsBasically we use recognise question entailment (RQE) to locate similar questions(answered questions) to the unanswered questions To find the relation between an-swered question and unanswered question intuitively we can pair one unansweredquestion with all the answered questions However this will generate a huge amountof question pairs To reduce the running time and increase efficiency we use threedifferent similarity score methods (TF-IDF BM25 and Word2Vec) to roughly filterthese pairs before use RQE model Through this step some dissimilar pairs areremoved Then we use natural language inference (NLI) to validate if the similarquestionsrsquo answers can be inferred from their corresponding unanswered questionsDue to the reason that we do not have the ground truth to evaluate the generateddata we use the idea of pseudo-relevance feedback The results show that our pro-posed method is feasible for RQE with TF-IDF and RQE with BM25 However thismethod is limited on NLI task due to the small size of data and the small value of kin k-fold cross-validation

v

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 111 Motivation 112 Main Contribution 213 Outline 3

2 Background 521 Background on NLP 5

211 Neural Network Language Model 5212 RNN and LSTM 7213 Transformer 8

22 Word Embedding 1023 Static Word Embedding 10

231 Word2Vec 10232 GloVe 11

24 Dynamic Word Embedding 13241 Pre-training 13242 ELMO 13243 BERT 15

25 Tasks 16251 Recognizing question entailment (RQE) 16252 Natural Language Inference (NLI) 17

3 Related Work 1931 RQE 1932 NLI 2033 Augment Data Size 20

4 Construction of Health24 Data set 2341 Motivation 2342 Collecting Data set 2443 Analysis for Data set 24

vii

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 2: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Except where otherwise indicated this thesis is my own original work

Jingjing Shi12 June 2020

Acknowledgments

bull Foremost I would like to express my sincere gratitude to Dr Zhenchang Xingand Mr Vincent Nguyen for their patient guidance continuous support andvaluable help during the past two semesters You broadened my knowledge ofnatural language processing and helped me to achieve a deeper understanding

bull I would also like to thank Prof Weifa Liang the course convener of this courseto provide significant help throughout this year for the presentations and thesiswriting

bull Last but not least I would like to extend my thanks to my parents and friendsWithout your encouragement patience and support I will not be the one whofinishes this work You all are the best

iii

Abstract

Question answering (QA) is widely investigated in the open domain However inthe specific domain such as patient question answering it is less studied Severalchallenges exist in the data sets of this domain such as lack of high-quality data setlimited annotated data and the content is not close to reality To address some ofthese issues we propose a novel way to crowdsource data for this particular domain

In our approach the data we use are from Health24 This is an interactive plat-form where the patients can ask their quires through it and the experts can answerthem accordingly However not every question is answered So our goal is tofind the answers from the existing answers to answer these unanswered questionsBasically we use recognise question entailment (RQE) to locate similar questions(answered questions) to the unanswered questions To find the relation between an-swered question and unanswered question intuitively we can pair one unansweredquestion with all the answered questions However this will generate a huge amountof question pairs To reduce the running time and increase efficiency we use threedifferent similarity score methods (TF-IDF BM25 and Word2Vec) to roughly filterthese pairs before use RQE model Through this step some dissimilar pairs areremoved Then we use natural language inference (NLI) to validate if the similarquestionsrsquo answers can be inferred from their corresponding unanswered questionsDue to the reason that we do not have the ground truth to evaluate the generateddata we use the idea of pseudo-relevance feedback The results show that our pro-posed method is feasible for RQE with TF-IDF and RQE with BM25 However thismethod is limited on NLI task due to the small size of data and the small value of kin k-fold cross-validation

v

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 111 Motivation 112 Main Contribution 213 Outline 3

2 Background 521 Background on NLP 5

211 Neural Network Language Model 5212 RNN and LSTM 7213 Transformer 8

22 Word Embedding 1023 Static Word Embedding 10

231 Word2Vec 10232 GloVe 11

24 Dynamic Word Embedding 13241 Pre-training 13242 ELMO 13243 BERT 15

25 Tasks 16251 Recognizing question entailment (RQE) 16252 Natural Language Inference (NLI) 17

3 Related Work 1931 RQE 1932 NLI 2033 Augment Data Size 20

4 Construction of Health24 Data set 2341 Motivation 2342 Collecting Data set 2443 Analysis for Data set 24

vii

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 3: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Acknowledgments

bull Foremost I would like to express my sincere gratitude to Dr Zhenchang Xingand Mr Vincent Nguyen for their patient guidance continuous support andvaluable help during the past two semesters You broadened my knowledge ofnatural language processing and helped me to achieve a deeper understanding

bull I would also like to thank Prof Weifa Liang the course convener of this courseto provide significant help throughout this year for the presentations and thesiswriting

bull Last but not least I would like to extend my thanks to my parents and friendsWithout your encouragement patience and support I will not be the one whofinishes this work You all are the best

iii

Abstract

Question answering (QA) is widely investigated in the open domain However inthe specific domain such as patient question answering it is less studied Severalchallenges exist in the data sets of this domain such as lack of high-quality data setlimited annotated data and the content is not close to reality To address some ofthese issues we propose a novel way to crowdsource data for this particular domain

In our approach the data we use are from Health24 This is an interactive plat-form where the patients can ask their quires through it and the experts can answerthem accordingly However not every question is answered So our goal is tofind the answers from the existing answers to answer these unanswered questionsBasically we use recognise question entailment (RQE) to locate similar questions(answered questions) to the unanswered questions To find the relation between an-swered question and unanswered question intuitively we can pair one unansweredquestion with all the answered questions However this will generate a huge amountof question pairs To reduce the running time and increase efficiency we use threedifferent similarity score methods (TF-IDF BM25 and Word2Vec) to roughly filterthese pairs before use RQE model Through this step some dissimilar pairs areremoved Then we use natural language inference (NLI) to validate if the similarquestionsrsquo answers can be inferred from their corresponding unanswered questionsDue to the reason that we do not have the ground truth to evaluate the generateddata we use the idea of pseudo-relevance feedback The results show that our pro-posed method is feasible for RQE with TF-IDF and RQE with BM25 However thismethod is limited on NLI task due to the small size of data and the small value of kin k-fold cross-validation

v

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 111 Motivation 112 Main Contribution 213 Outline 3

2 Background 521 Background on NLP 5

211 Neural Network Language Model 5212 RNN and LSTM 7213 Transformer 8

22 Word Embedding 1023 Static Word Embedding 10

231 Word2Vec 10232 GloVe 11

24 Dynamic Word Embedding 13241 Pre-training 13242 ELMO 13243 BERT 15

25 Tasks 16251 Recognizing question entailment (RQE) 16252 Natural Language Inference (NLI) 17

3 Related Work 1931 RQE 1932 NLI 2033 Augment Data Size 20

4 Construction of Health24 Data set 2341 Motivation 2342 Collecting Data set 2443 Analysis for Data set 24

vii

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 4: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Abstract

Question answering (QA) is widely investigated in the open domain However inthe specific domain such as patient question answering it is less studied Severalchallenges exist in the data sets of this domain such as lack of high-quality data setlimited annotated data and the content is not close to reality To address some ofthese issues we propose a novel way to crowdsource data for this particular domain

In our approach the data we use are from Health24 This is an interactive plat-form where the patients can ask their quires through it and the experts can answerthem accordingly However not every question is answered So our goal is tofind the answers from the existing answers to answer these unanswered questionsBasically we use recognise question entailment (RQE) to locate similar questions(answered questions) to the unanswered questions To find the relation between an-swered question and unanswered question intuitively we can pair one unansweredquestion with all the answered questions However this will generate a huge amountof question pairs To reduce the running time and increase efficiency we use threedifferent similarity score methods (TF-IDF BM25 and Word2Vec) to roughly filterthese pairs before use RQE model Through this step some dissimilar pairs areremoved Then we use natural language inference (NLI) to validate if the similarquestionsrsquo answers can be inferred from their corresponding unanswered questionsDue to the reason that we do not have the ground truth to evaluate the generateddata we use the idea of pseudo-relevance feedback The results show that our pro-posed method is feasible for RQE with TF-IDF and RQE with BM25 However thismethod is limited on NLI task due to the small size of data and the small value of kin k-fold cross-validation

v

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 111 Motivation 112 Main Contribution 213 Outline 3

2 Background 521 Background on NLP 5

211 Neural Network Language Model 5212 RNN and LSTM 7213 Transformer 8

22 Word Embedding 1023 Static Word Embedding 10

231 Word2Vec 10232 GloVe 11

24 Dynamic Word Embedding 13241 Pre-training 13242 ELMO 13243 BERT 15

25 Tasks 16251 Recognizing question entailment (RQE) 16252 Natural Language Inference (NLI) 17

3 Related Work 1931 RQE 1932 NLI 2033 Augment Data Size 20

4 Construction of Health24 Data set 2341 Motivation 2342 Collecting Data set 2443 Analysis for Data set 24

vii

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 5: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

vi

Contents

Acknowledgments iii

Abstract v

1 Introduction 111 Motivation 112 Main Contribution 213 Outline 3

2 Background 521 Background on NLP 5

211 Neural Network Language Model 5212 RNN and LSTM 7213 Transformer 8

22 Word Embedding 1023 Static Word Embedding 10

231 Word2Vec 10232 GloVe 11

24 Dynamic Word Embedding 13241 Pre-training 13242 ELMO 13243 BERT 15

25 Tasks 16251 Recognizing question entailment (RQE) 16252 Natural Language Inference (NLI) 17

3 Related Work 1931 RQE 1932 NLI 2033 Augment Data Size 20

4 Construction of Health24 Data set 2341 Motivation 2342 Collecting Data set 2443 Analysis for Data set 24

vii

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 6: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Contents

Acknowledgments iii

Abstract v

1 Introduction 111 Motivation 112 Main Contribution 213 Outline 3

2 Background 521 Background on NLP 5

211 Neural Network Language Model 5212 RNN and LSTM 7213 Transformer 8

22 Word Embedding 1023 Static Word Embedding 10

231 Word2Vec 10232 GloVe 11

24 Dynamic Word Embedding 13241 Pre-training 13242 ELMO 13243 BERT 15

25 Tasks 16251 Recognizing question entailment (RQE) 16252 Natural Language Inference (NLI) 17

3 Related Work 1931 RQE 1932 NLI 2033 Augment Data Size 20

4 Construction of Health24 Data set 2341 Motivation 2342 Collecting Data set 2443 Analysis for Data set 24

vii

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 7: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

viii Contents

5 Methodology 2751 Pre-processing 2752 Fine-Tuning 2753 Method 2854 Similarity Scores Methods 29

541 Term Frequency-Inverse Document Frequency (TF-IDF) 29542 Best Matching 25 (BM25) 30543 Word2Vec 30

55 Evaluation Metrics 31

6 Discussion on Results 33

7 Conclusion and Future Work 3771 Conclusion 3772 Future Word 38

Bibliography 39

Appendices 43

Appendix 1 Project Description and Contract 45

Appendix 2 Artefact Description 49

Appendix 3 Readme File 51

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 8: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

List of Figures

21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003] 6

22 A single memory cell [Graves et al 2013] 823 A single memory cell consists of encoder (left part) and decoder (right

part) [Vaswani et al 2017] 924 Two different models that could be used as the part of Word2Vec The

left hand side is CBOW and the right hand side is Skip-gram [Mikolovet al 2013] 11

25 The measurement of co-occurrence probabilities from the selected con-text words and the target words ice and steam [Pennington et al 2014] 12

26 EMLo is formed by a two-layer bidirectional language model Theforward and the backward passes form a multilayer LSTM For eachLSTM it passes the result to the intermediate word vector The wordrepresentation can be get by summing all the intermediate word vec-tors with some particular weights [Josht] 14

51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 useCHQs data to fine-tune BERT model Step 2 use CHQs data to evaluatethe model by using k-fold cross-validation and record each accuracyStep 3 use fine-tuned BERT model to generate label for unansweredquestion and potential similar question Step 4 pair question pairswith generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracywith the previous one As for NLI it has the same process except theinput of testing is question-answer pairs 32

1 The tree structure of health24 file 50

ix

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 9: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

x LIST OF FIGURES

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 10: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

List of Tables

41 The length of questions and answers 25

51 Table for the confusion matrix 31

61 Shows the accuracy in different methods for RQE task 3462 Shows the added data number and p-values for RQE task 3463 Shows the accuracy in different methods for NLI task 3564 Shows the added data number and p-values for NLI task 3565 Shows the standard derivation of loss for four methods 36

xi

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 11: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

xii LIST OF TABLES

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 12: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 1

Introduction

In this section motivation is be introduced in terms of different existing data setsand the overfitting problem based on small medical data set In Section 12 the maincontributions is briefly discussed following by the outline of the thesis

11 Motivation

Question answering (QA) is a downstream task in information retrieval that thesystem can automatically provide a set of correct answers to the question askedby the users in natural language [Lende and Raghuwanshi 2016] This topic iswidely researched in the open-domain Medical domain questions are the secondmost searched thematic area in Google and the topics take 5 of all the others in2016 [Cocco et al 2018] Detecting the relations between patientsrsquo questions and ex-pertsrsquo answers has practical meaning since Pew Research Center1 showed that morethan 35 American adults had gone online to figure out what medical conditionsthey might have

However patient question answering is less studied This domain has many chal-lenges such as the quality and the size of the data set limited annotated data andif the content close to reality There are some existing data sets which are related tothis topic but they all contain some limitations Stanford Natural Language Inference(SNLI) data set [Gururangan et al 2018] is a large data set with high quality It servesa benchmark to evaluate natural language inference systems [Romanov and Shivade2018] whereas the data is consisted of short and simple sentences that cannot mimicthe human behaviour in reality Stanford Question Answering Dataset (SQuAD) [Ra-jpurkar et al 2016] consistes of more than 100000+ questions by crowdsourcing aset of Wikipedia articles where the answers are the segments of texts from the para-graphs Although this data set has high quality and leads to outstanding results itfocuses on the open-domain instead of a specific domain I2b2 is the domain-specificlarge-scale synthetic data set in the domain of electronic medical records [Pampariet al 2018] However the synthetic data set performs weaker than the handcrafteddata sets [Nguyen 2019]

1Pew Research Center httpswwwpewresearchorginternet20130115health-online-2013

1

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 13: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

2 Introduction

On the other hand the size of medical set is another concern since the smalldata set could cause overfitting The medical data sets tend to be locked due tothe reason of ethical and obligatory agreement Also cost constraints and lack ofthe domain experts for annotation are another reasons which cause the small dataset [Nguyen et al 2019] In natural language processing (NLP) field the modelis prone to overfitting by using the small data sets and lead to Out-Of-Vocabulary(OOV) problem [Luong et al 2014] OOV problem is that an unseen token en-countered in the fixed vocabulary language model but the model cannot handle itappropriately [Nguyen et al 2019] In NLP tasks there are high accuracy of unseenwords If the data set is small then the model cannot generalize very well due to theunseen words Then overfitting will occur

To address the above issues we use recognize question entailment (RQE) andnatural language inference (NLI) to generate and expand high-quality data set inorder to help the researcher to investigate patient question answering The dataset is based on Health 24 a medical specific forums so it can mimic real-worldbiomedical question answers The constructed data set can be used in the data-drivenapproaches [Nguyen 2019]

The current top-performing systems for RQENLI use transfer learning Basi-cally they do pre-training on the generic tasks by using supervised or unsupervisedmethods and then use fine-tune to the specific downstream tasks [Chien and Kalita2020] This approach is different to the older models which apply task-specific ar-chitecture on the task-specific labeled data Also transfer learning outperforms theolder models since the pre-training can support better generalization [Erhan et al2010]

In this thesis we used state-of-the-art language model BERT [Devlin et al 2018]to study question entailment and natural language inference in order to crowdsourcemore data Generally speaking providing the existing answers automatically if theoriginal questions are similar to the answered consumer medical questions

12 Main Contribution

Our goal is to create and expand question-answer pairs which can be used in thedomain of patient question answering In general we use fine-tuned RQE model tocheck if the answered questions are similar to the unanswered questions in Health24Then we use fine-tuned NLI model to validate whether these similar answered ques-tionsrsquo answers can be inferred from their unanswered questions For example inHealth 24 through fine-tuned RQE model we know the unanswered question Ais similar to the answered question B Fine-tuned NLI model is to determine if theanswer of question B can be inferred from question A Since we do not have groundtruth we use the idea of pseudo-relevance and significance test to evaluate thesesteps

Our contributions can be summarized as follows

bull A collection of 25721 medical question-answer pairs (includes unanswered

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 14: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect13 Outline 3

questions) collected from the trusted source such as Health24 websites

bull A study of BERT model apply to RQE and NLI Using Consumer health ques-tions (CHQs) and MedNLI to fine-tune RQE and NLI model respectively

bull In order to reduce the running time when checking if the answered questionsare similar to unanswered questions we apply BM25 TF-IDF and Word2Vec tofilter some dissimilar questions pairs before applying RQE model

bull Set thresholds in the output of fine-tuned RQE and NLI models when usingthe models to locate the similar questions and validate the answers If themaximum probability is lower than this threshold the data will be rejectedOtherwise we will accept them This can help us to deal with some ambiguouscases and reduce the misclassification rates

bull An evaluation of RQE and NLI model is demonstrated by using Pseudo-relevancefeedback and significance tests The results show our approach is feasible inRQE task However this method is more limited on the NLI task

13 Outline

The thesis is organized as followsIn Chapter 2 we provide some background about different models about word

embedding and introduce BERT modelIn Chapter 3 related work about QRE NLI and augmenting data set are demon-

stratedIn Chapter 4 the motivation of the data set and the analysis is explainedIn Chapter 5 the methods and the implementation of fine-tuning are explained

Besides we introduce three different scoring methods to filter the data Evaluationis illustrated as well

In Chapter 6 we demonstrate and discuss the result of fine-tuned BERT modelIn Chapter 7 this is the conclusion chapter that we summarize the achievements

as well as the limitations and show the future work

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 15: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

4 Introduction

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 16: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 2

Background

In this chapter we briefly introduce the background of NLP with some basic modelsIn terms of the project different types of word embedding architectures are explainedin order to help the readers to understand this thesis Besides the meaning and theimportance of RQE and NLI are introduced as well so that the readers can have asimple idea about the project

21 Background on NLP

The core knowledge about this thesis is natural language processing (NLP) NLP isa sub-field of linguistics and artificial intelligence Its aim is to program the com-puters in order to analyze and process the large amounts of natural language dataIn other words NLP is how computer can understand human language or modellanguage NLP is a broad area and it includes many branches here we list the mostcommonly researched tasks syntax semantics discourse speech and dialogue Forthis project we mainly focus on semantics Meanwhile language models are thecrucial part in NLP because they describe the relationships between the words Weintroduce three different language model architectures to help the readers have abetter understanding of word embedding in the next section

211 Neural Network Language Model

Neural Network Language Model (NNLM) is the central model in NLP and manyother models are based on this Basically the goal of NNLM is to predict the prob-ability of next word by given the previous words in a text [Bengio et al 2003] Themodel is composed of four layers They are input projection hidden and outputlayers respectively At the input layer N previous words are encoded by using one-hot encoding Each word has V dimension where V is the size of the vocabularyThe input layer is projected into the projection layer by using a shared projectionmatrix where the size is Vtimes m and m is the feature vectors associated with eachword After projection each word is a row vector with the dimensionality of m Thiscan be showed as C(wt-n+1)C(wt-2) andC(wt-1) in Figure 21

5

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 17: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

6 Background

Figure 21 NNLM consists of four layers from bottom to to they are input layerprojection layer hidden layer and output layer [Bengio et al 2003]

As for the hidden layer it is the concatenation of word features from the previouslayer

x = (C(wt-1) C(wt-2) C(wt-n+1))

and the vector goes through the hyperbolic tangent tanh Following by that softmaxin the output layer is used to calculate the probability of the next words given by theprevious words

P(wt|wtminus1 wtminusn+1) =eywt

sumi eyi

Function yi is the unnormalized log-probabilities for each output word i

y = b + Wx + Utanh(d + Hx)

where b and d are the biases and H is the hidden layer weight [Bengio et al 2003]

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 18: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect21 Background on NLP 7

212 RNN and LSTM

Instead of the traditional feed-forward neural network where the information ispassed forward and the cycle is forbidden the recurrent neural network (RNN) al-lows cycles and the neurons get information at time-step t depends on the output attime-step t-1 In other words at each step the output of the current time-step willdepend on the previous one RNN has sequential inputs and ideally it can learnnot only the current state of the node but also the previous state However it hasthe tendency that the gradient will vanish or explode over many steps The reasonis that RNN is trained by backpropagation through the timea When the gradientspass back the gradients are calculated by continuous multiplications of derivativesIf the values of derivatives are too small or too large the continuous multiplicationsmay cayse the gradient to explore or vanish Besides there are some evidences showthat it is difficult for the traditional RNN to store the information for a long time inpractice

Long short term memory (LSTM) is a special kind of RNN architecture It hasconsiderable performance in the NLP fields such as sentiment classification textsynthesis Compared with the standard RNN it introduces some special hiddenunits to store the information for a long time They are memory cells They connectwith each other at the next time-step with particular weights [Graves et al 2013]Figure 22 shows a memory cell Here are the composite functions

it = σ(Wxixt + Whihtminus1 + Wcictminus1 + bi)

ft = σ(Wx f xt + Wh f htminus1 + Wc f ctminus1 + b f )

ct = ftctminus1 + ittanh(Wxcxt + Whchtminus1 + bc)

ot = σ(Wxoxt + Whohtminus1 + Wcoct + bo)

ht = ottanh(ct)

where σ is the sigmoid function A single memory cell is consisted of input gateforget gate output gate and cell activation vectors and they can be represented by if o and c respectively The W weight matrices from the cell to the gate vectors arediagonal so the input from element m can only be received by element m in eachgate vector [Graves et al 2013]

Within a memory cell LSTM can be divided into four steps

bull Based on the current input and htminus1 ft is to determine what information shouldforget If the output of sigmoid function is 0 that means the information willbe forgotten otherwise the information will be remembered

bull The next step is to determine what new information we store in the cell stateThere are two parts First it determines what information we will update Anew ct is created and added it to the state

bull Update ct by using the value in ft old state the value in it and the current

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 19: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

8 Background

Figure 22 A single memory cell [Graves et al 2013]

state Namely updating ftctminus1 + it ct

bull As for output gate use sigmoid function to determine what part of the cell statewill be outputted Then applyinh tanh function to output the decided part

213 Transformer

RNN and LSTM are the traditional approaches in sequence modeling and transduc-tion problems such as language translation Transformer is another sequence trans-duction model that allows parallelization It has high performance with significantlyless training time [Vaswani et al 2017]

The model architecture is shown in Figure 23 From the figure we can see themodel is composed of two parts one is encoder and the other one is decoder As forencoder it consists of a stack of N=6 identical layers Each layer has two sub-layersone is multi-head attention and the other one is position-wise fully connected feed-forward network So the output of each sub-layer is LayerNorm(x + Sublayer(x))where Sublayer(x) is output of the function implemented in the sub-layer itself Asfor decoder the structure is the same as encoder where it has a stack of N=6 identical

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 20: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect21 Background on NLP 9

layers Instead of two sub-layers in encoder decoder has three The extra one ismasked multi-head attention and this can prevent the model to see the content afterthe predicting word [Vaswani et al 2017]

Figure 23 A single memory cell consists of encoder (left part) and decoder (rightpart) [Vaswani et al 2017]

As for self-attention its aim is to calculate the relatedness between two differenttokens in one sequence and find out which one should pay more attention Foreach token the three parameters (query key and value) can be generated duringthe training by multiplying the weights with the input embedding The attentionfunction is

Attention(Q K V) = so f tmax(QKTradic

dk)

where dk is the dimension of query and key Multi-head attention is similar to self-attention and the only difference is that it linearly projects the queries keys andvalues with learned linear projections to dk sk and dv dimensions respectively

MultiHead(Q K V) = Concat(head1 headh)WO

where headi = Attention(QWQi KWK

i VWVi )

where WQi isin Rdmodeltimesdk WK

i isin Rdmodeltimesdk WVi isin Rdmodeltimesdv and WO isin Rhdvtimesdmodel The

benefit is that it allows the model for subspace representation learning and helps themodel focus on the different positions The final embedding is not just the word

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 21: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

10 Background

anymore Positional Encoding is the way to inject some information about a relativeor absolute position about the token because the model does not use recurrence orconvolution

22 Word Embedding

When dealing with natural language it is always problematic since the text is com-posed of words and characters However computers and machine learning modelscannot read and understand these symbols in the human senses and the only thingthat they can accept is the numerical values

Ideally when we convert these texts into numerical representations we wantthem to be semantically meaningful In other words the numerical values shouldcapture the linguistic meaning of the words as much as possible since an informativeand well-chosen numerical representation method can impact on the model perfor-mance Word embedding is the dominating approach to solve this problem It is sopervasive that most of NLP projects use this no matter in the sentiment analysis orin the text classification In general when starting an NLP project people usuallystart by downloading a pre-trained embedding or using their own ways to calculatethe word embedding for their own data sets Generally the word embedding can beseparated into two parts one is static word embedding and the other one is dynamicword embedding For the remaining part we will introduce several different wordembedding approaches based on their complexity

23 Static Word Embedding

For the static word embedding it has the feature that it generates the same wordembedding vector for the same word no matter what the context is

231 Word2Vec

Mikolov et al [2013] introduce two novel architectures to represent the words in thecontinuous vector They learn high-quality word vectors from the huge data setswith billions of words and with millions of words in the vocabulary Meanwhilethe proposed models achieved not only the vectors of similar words are close to eachother but also the words can have multiple degrees of similarity

The two architectures are introduced which can be used as the part of Word2Vecto learn the word embedding they are continuous bag-of-words model (CBOW) andcontinuous skip-gram model (Skip-gram) The former is to learn the embeddingby predicting the current word based on its context and the latter is to learn bypredicting the surrounding words These two models are both similar to the feed-forward neural net language model (NNLM) which is introduced by Bengio et al[2003]

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 22: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect23 Static Word Embedding 11

However one of the biggest problems of NNLM is that it has high computationalcomplexity especially between the projection and hidden layers since they are bothfully connected [Mikolov et al 2013] To address this issue Word2Vec makes someimprovements As for CBOW its target is to predict the current word based on thecontext Mikolov et al [2013] remove the hidden layer and the projection layer isshared for all the words Thus all the words are projected in the same positionAlso for the input layer CBOW adds the words from the future rather than onlyadding the words before the target This could help the model to predict basedon the context As for Skip-gram its mechanism is opposite of CBOW which is topredict the surrounding words by given the current word So the model inverts theinput and target The figure is in Figure 24

Figure 24 Two different models that could be used as the part of Word2Vec Theleft hand side is CBOW and the right hand side is Skip-gram [Mikolov et al 2013]

232 GloVe

Another word embedding method is Global Vectors (GloVe) which is introduced byPennington et al [2014] The researchers think although Word2Vec does the well jobon analogy tasks it only trains the word embedding on the separate local contextwindows instead of on the global co-occurrence counts This way may fail to use ahuge amount of repetition in the data As for GloVe the global corpus statistics canbe captured directly by using its novel way

The main intuition of GloVe is through the simple observation that for encoding

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 23: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

12 Background

some form of the meaning the ratios of word-word co-occurrence probabilities havethe potential Before analyzing Figure 25 some notations are established X repre-sents the matrix of word-word co-occurrence counts where the entity Xij means thenumber of times word j occurs in the context of word i within the specific contextwindows Xi means the number of times any words occur in the context of wordi equally sumk Xik Besides Pij = P(j|i) = XijXi it is the probability that word jappears in the context of word i In Figure 25 suppose i = ice and j = steam we can

Figure 25 The measurement of co-occurrence probabilities from the selected contextwords and the target words ice and steam [Pennington et al 2014]

study the ratio of their co-occurrence probabilities with different context words kto exam the relationship of these words We can see solid co-occurs more frequentlywith ice than it does with steam Similarly gas co-occurs more frequently with steamthan it does with ice Based on the observations GloVe proposes a new weightedleast squares regression model and the cost function is

J =V

sumij=1

f (Xij)(wTi wj + bj + bj minus log Xij)

2

where V is the size of the vocabulary f (Xij) is a weighting function wi isin R is wordvector wj isin R is context word vector and bi and bj are biases The weighting functionhas the following properties First f (0) = 0 if one word doesnrsquot co-occur with otherwords Namely Xij = 0 Second this function should be non-decreasing in orderto guarantee the weight of frequent co-occurrences is higher than the weight of rareco-occurrences Third it should not overweight frequent co-occurrences To satisfythese properties and work well the weighting function is a piecewise function

f (x) =

(xxmax)α if x lt xmax

0 otherwise

where xmax is 100 and α is 34 The more detailed information can be found inPennington et al [2014]

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 24: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect24 Dynamic Word Embedding 13

24 Dynamic Word Embedding

Although static word embedding methods are widely used in the NLP area they failto capture the polysemy (such as the word rsquobankrsquo does it mean a financial institutionor the land alongside the river) since they generate the same word vectors regardlessof the contexts Besides these traditional word embedding methods usually usethe shallow models and only leverage off the output from unsupervised models fordownstream tasks whereas the models are usually discarded after training Thiscauses the situation that the word vectors cannot be adjusted based on the contextTo address these issues there is a new trend to go from static word embeddingto dynamic word embedding As for dynamic word embedding it considers theword semantics in different contexts by using deep neural language models [Wangand Kuo 2020] For this section we will involve two start-of-the-art dynamic wordembedding methods one is ELMO and the other one is BERT Before introducingthese two methods pre-training will be briefly mentioned since these two methodsboth use pre-training to generalize the parameters

241 Pre-training

In order to make the word vectors become meaningful and capture the high-levelinformation as much as possible more and more researchers prefer to use the deeparchitectures such as the neural networks with many hidden layers The reason isthat complicated functions could represent high-level abstractions by applying thesedeep architectures [Erhan et al 2010] Also Erhan et al [2010] suggest that usingthe standard training schemes(random initialization) to train the deep architecturestends to make the parameters to have a poor generalization but the improvementis impressive for the deep models by using pre-training strategy Nowadays in thedifferent types of natural language processing tasks language model pre-traininghas advanced the state-of-the-art substantially [Dong et al 2019] In the case of NLPthe basic idea of pre-training is that the language models learn the contextualizedtext representations through a large amount of corpus with labeled or non-labeleddata sets Then the pre-trained model is fine tuned by the downstream tasks with asmall number of labeled data set

242 ELMO

Learning the high-quality word representations is a challenging thing since not onlythe model needs to consider the complex characteristics of the word use but alsothe words have different meanings across different linguistic contexts [Peters et al2018] In 2018 Peters et al [2018] propose a novel way which called Embeddingsfrom Language Models (ELMo) to address the mentioned challenges and generatehigh-quality word representations The experiments show ELMo works extremelywell in practice which can have up to 20 error reductions compared with otherstate-of-the-art

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 25: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

14 Background

Instead of using the fixed embedding vector for each word ELMo considers theentire sentence before assigning the embedding for each word In other words eachtoken is assigned a function of the entire input sentence [Peters et al 2018] ELMois consisted of a two-layer bidirectional language model (biLM) so it is a way toembedding from language models These two layers stacked together and eachof them has two passes ndash forward and backward As for the forward pass theinternal state of the certain word contains the information about itself and the contextbefore this words The backward pass is similar to forward except it contains theinformation about itself and the context after it The pair of forward and backwardinformation produce an intermediate word vector This intermediate word vectorrepresents the wordrsquos mean but it also lsquoknowsrsquo what happens in the rest of thesentence

Figure 26 EMLo is formed by a two-layer bidirectional language model The for-ward and the backward passes form a multilayer LSTM For each LSTM it passesthe result to the intermediate word vector The word representation can be get by

summing all the intermediate word vectors with some particular weights [Josht]

In addition those forward passes and backward passes form a multilayer LSTMas we can see in Figure 26 The intermediate word vectors are produced by the lowerlayers are fed into the higher layer The basic syntactic information is captured in thelower layers while semantic information is captured in the higher layers For each

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 26: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect24 Dynamic Word Embedding 15

token tk there are 2L+1 representations in a L-layer biML

Rk = xLMk minusrarrh LM

kj larrminush LM

kj |j = 1 L

As for the final word representation used for downstream NLP tasks ELMo com-bines all layers in R to a single vector based on some particular weight

ELMotaskk = E(Rk Θtask) = γtask

L

sumj=0

staskj hLM

kj

where stask is softmax-normalized weights γtask is scalar parameters and hLMkj =

[minusrarrh LM

kj larrminush LM

kj ]

243 BERT

Currently there are two approaches to apply the pre-trained language representa-tions to the downstream tasks one is feature-based approach and the other oneis fine-tuning approachAs for the feature-based approach the pre-trained repre-sentations are included as the additional features [Peters et al 2018] As for thefine-tuning approach the pre-trained modelrsquos parameters are unfrozen and do fine-tuning on the downstream tasks [Peters et al 2019] Both approaches share the sameobjective functions during pre-training and they have a similar performance ex-cept a general-purpose representation can be adapted to many different downstreamtasks in the fine-tuning approach [Peters et al 2019] Instead of using the unidirec-tional language models Devlin et al [2018] propose BERT stands for BidirectionalEncoder Representations from Transformers to improve the current the fine-tuningbased approach Although ELMo and BERT are both bidirectional LSTM in ELMo isnot deeply bidirectional since it only simply concatenates the left-to-right and right-to-left information and the representation still cannot simultaneously take advantageof both left and right contexts On the contrary transformers in BERT is deeply bidi-rectional since it can compute the attention over the entire sequence for every word(a wordrsquos meaning is defined by its context in the sentence)

BERTrsquos architecture is a multi-layer bidirectional Transformer encoder The de-tailed information about transformer is introduced in Section 213 The only differ-ence is that BERT only takes the encoder part BERT adds [CLS] for every sequence(pair of sentences) and uses [SEP] to separate sentences For each given token itsinput representation is the sum of token embedding segment embedding and theposition embedding

Devlin et al [2018] purpose two unsupervised tasks to do pre-training one ismasked language model (masked LM) and the other one is next sentence prediction(NSP) For masked LM BERT masks some percentage (15) of the input tokensrandomly and then predict these masked tokens Although this could allow us toget the bidirectional pre-trained model a downside exists that a mismatch is createdbetween pre-training and fine-tuning because [MASK] token does not appear in the

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 27: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

16 Background

fine-tuning step To solve this BERT replaces the selected token with [MASK] in80 of time replaces with a random token in 10 of the time and keeps the originaltoken in the other 10 of the time On the other hand BERT pre-trains for a binarizedNSP task in order to help the model to understand the relationship between twosentences The reason is that many downstream tasks such as natural languageinference aim to determine if one sentence can be inferred from others The result ishard to capture from the language modeling directly without knowing the meaningof these two sentences By using NSP the pre-trained model is very beneficial forthese tasks The introduced pre-trained BERT model will be applied for RQE andNLI tasks in this project As for how to fine-tuning we will introduce in the nextfollowing chapters

25 Tasks

251 Recognizing question entailment (RQE)

Recognizing question entailment (RQE) represents a relation between two questionsexpressing the fact that ldquoa question A entails a question B if every answer to B is alsoa complete or partial answer to Ardquo [Abacha and Demner-Fushman 2016] Or in theplain English it is to determine if question A and question B are similar questionsThis task is binary classification it includes positive (two questions are similar) andnegative (two questions are not similar) Here are two examples from our data set

bull Example 1

bull (No answered) Question A Hi is it safe to take allergex Irsquom weeks pregnantand allergies are killing me my face is swollen pls

bull (Answered) Question B Hi my question is with regards to allergex tablets Ihave allergies and have managed to control them using allergex tablet pur-chased over the counter I am wanting to fall pregnant soon and want to knowthe dangers of using allergex preconceptionally and or during pregnancy

bull Question 1rarr Question 2

bull Example 2

bull (No answered) Question A Should I take my child for a flu shot she is allergicto eggs fish and honey

bull (Answered) Question B I am diabetes 2 Type Which is better Honey or sugar

bull Question 1 does not entail Question 2

RQE is the first step in our project and it uses here can help us to find the similarquestions which related to the unanswered questions Also this step can providesome useful information for the subsequent steps

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 28: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect25 Tasks 17

252 Natural Language Inference (NLI)

Natural language inference (NLI) is to determine the relationship between two sen-tences not just for questions In the task it is to predict if a hypothesis (H) can beinferred (entailment) not inferred (contradiction) or neither (neutral) from the givenpremise (P) [Romanov and Shivade 2018] This is a multilabel classification (threeclasses) problem one is entailment one is neutral and the other one contradictionHere are three examples from our data set

bull Example 1

bull Question Ever since this morning I had pain in my bladder and my urine isburningIt feels like my bladder is full the whole time and when I go to thetoilet there is nothing coming out

bull Answer It sounds like you may need something more See your doctor tocheck for a bladder infection Best wishes

bull Label entailment

bull Example 2

bull Question I am experiencing hot flushes I donrsquot want to take HRTrsquos I enjoygoing to the gym try to eat healthy and drink a lot of water every day I amvery concerned that by taking Menograine or the like I will gain weight or itwill prevent me from losing or maintaining my weight I do not take any othermedication other than vitamins

bull Answer There are health diets you can follow DietDoc will be able to helpyou Best wishes

bull Label contradiction

bull Example 3

bull Question I have noticed that my friends are using protein shakes and othersupplements to lose weight and getting muscles Do you advise that i useprotein shakes in my weight loss goals

bull Answer Hi there There was a time where any exercise during pregnancywas frowned upon it simply was not done it was considered harmful to themother and the baby This is crucial and once its obtained its advisable tokeep monitoring it with the doctor at all times

bull Label natural

In the context of our project this task used here is to validate if the similar answeredquestionsrsquo answers can be inferred from the unanswered questions in order to help toconstruct high accuracy question-answer pairs for data-driven models in the domainof patient question answering

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 29: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

18 Background

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 30: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 3

Related Work

In this chapter we introduce the works which are related to our purpose and themethods We also explain the architectures and techniques as well as the contribu-tions of them

In Section 31 and 32 we introduce some work that they use RQE and NLIon the medical data set in order to find the relation between the sentences As forSection 33 we demonstrate some work about how to expand and generate data set

31 RQE

MEDIQA 2019 shared task [Abacha et al 2019] is organized at the ACL-BioNLPworkshop Its motivation is to develop relevant methods and techniques in the med-ical domain for inference and entailment Meanwhile the application could help toimprove the domain-specific information retrieval and question answering systemsIn this competition it includes three tasks NLI RQE and question answering (QA)and the target is in the medical domain In this section and the next section wediscuss some work applied in RQE and NLI among the competition teams

The initial work for RQE is to use the lexical and semantic features as the in-put to the supervised machine learning approach Then using support vector ma-chine(SVM) logistic regression naiumlve Bayes and J48 respectively to find the re-sults [Abacha and Demner-Fushman 2016] In the competition the highlight is thatusing the deep networks reach better generalizations and abstractions of the ques-tions The commonly used approaches are to combine the ensemble method and themulti-task language models [Abacha et al 2019] More specifically the teams reliedon the recent state-of-the-art language models such as MT-DNN family ([Bhaskaret al 2019] [Zhu et al 2019]) and BERT family When trying the different tradi-tional deep learning methods Zhu et al [2019] find the result does not perform verywell compared with the pre-trained language model They think one reason is thetraditional deep learning models typically use a fixed pre-trained word embeddingto map the words into vectors and these embeddings do not consider the contextMeanwhile Zhu et al [2019] also find for the pre-trained language model MT-DNNbased model performs better than BERT based model since MT-DNN family is fine-tuned on GLUE benchmark multi-task learning mechanism based on BERT models

19

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 31: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

20 Related Work

So MT-DNN has a better ability to model the pairwise text classification tasks com-pared with the original BERT models on different tasks A mismatch exists in thetraining and validation set in the competition data sets such as the entailing ex-amples have high lexical overlap and the non-entailing examples are completelyunrelated so these cannot be the strong negative examples in the training set Toaddress this issue some synthetic data which are similar to the original data are cre-ated [Kumar et al 2019] In addition Bhaskar et al [2019] also expand the trainingset in order to capture salient biomedical features

32 NLI

The initial work for NLI occurs when Romanov and Shivade [2018] introduce MedNLIa new and public data set to reduce the gap for specialized and knowledge-intensivedomains They use feature-based system Bag-of-Words model (BOW) InferSentmodel and ESIM model to introduce the baseline Among them InferSent performsbetter compared with the other three methods In the competition same as RQEtask the common approaches for training are to use MT-DNN and BERT family andthen using an ensemble as the final system to generate the results [Abacha et al2019] On the other hand Wu et al [2019] purpose a hybrid approach for NLI Thebasic model of this approach is consisted of three encoders (syntax encoder text en-coder and feature encoder) with one softmax classifier As for text encoder they useMT-DNN and this allows more powerful and general word representations As forsyntax encoder they use Tree-LSTM to have a better linguistic understanding Forfeature encoder they use domain feature extractor and generic feature extractor toconcatenate their results This model gets significant performance in the testing dataset

33 Augment Data Size

Labeling and generating new medical queries is a challenging thing in the medicaldomain since it not only takes money but also spends time To label and generatehigh-quality data relevant domain experts need to be hired In general one expertis not enough since the results sometimes are too subjective that affect the qualityof the data [Romanov and Shivade 2018] There are many existing techniques thatcan be used in order to generate and label the required data sets Snorkel Dry-Bell [Bach et al 2019] is a novel way to label the data set by using weak supervisionmethods The result shows that this method plays a significant role in the industrialdevelopment of machine learning applications The basic idea of this approach isthat users write labeling functions for the unlabeled data set firstly Then severallabeling functions will pass to a generative model in order to estimate the accuracyfor different functions Following that the accuracy will be used to re-weight andthe probabilistic labels will be produced based on the combined labelsOn the otherhand one novel way to generate the data with high-quality in the specific knowledge

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 32: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect33 Augment Data Size 21

is proposed by Shen et al [2018] They use an unsupervised key phrase detector anda generator to generate data in the medical domain The results show that the gen-erated data is significantly improved for the examination QA system

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 33: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

22 Related Work

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 34: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 4

Construction of Health24 Data set

In this chapter we introduce the motivation for choosing data set Furthermore wedemonstrate the way of data collection We also analyze the collected data set fromdifferent perspectives such as the length and the content

41 Motivation

There are less researches in the area of patient question answering Also the existingmedical queries data sets have some limitations For example the data are too simplethat cannot mimic human behavior in reality or the lexical gap exists between thepatients and the experts To address these gaps and construct high-quality data setwhich can reflect the real-world biomedical patient question answering we think thedata set should have the following features

bull The data source must be reliable since it can set the standard in this domainand it is easier to convince others Also a reliable data set is crucial for ideasand models because it determines whether they are successful

bull The data source must come from the real-world In other words the questionsshould be asked by patients instead of finding or extracting them from relevantprofessional materials Meanwhile the answers should come from certifieddoctors In this way the constructed data can have good quality and they canbe used in the data-driven approaches

bull The size of the data set should be large As we mentioned before if the dataset is moderate or small the model is easier to cause overfitting and affect theaccuracy

bull The data set should not have any sensitive information such as address dateof the birth email address phone number Otherwise it will offend patientsrsquoprivacy

Based on the features as we describe above we find Health24 1 is a good choice forconstructing This is South Africarsquos leading consumer health website It can provide

1Health24httpswwwhealth24com

23

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 35: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

24 Construction of Health24 Data set

world-class information for a healthy lifestyle Besides there is an interactive toolfor patients to help hem answer their questions These questions and answers areboth public and they are continuously updating Hence we use these questions andanswers to construct our data set

42 Collecting Data set

For collecting the data from Health24 website we use Scrapy2 a fast and powerfulcollaborative framework for extracting the data from websites There are 22 big cat-egories on the website such as Allergies Burning urine Cholesterol Level Depres-sion etc We only crawl the questions and the expertsrsquo answers from these categoriesThe reason is that the expertsrsquo answers are more reliable and professional Maybesome mismatch or misleading exists between questions and usersrsquo answers

43 Analysis for Data set

A total of 25705 question-answer pairs are crawled from Health24 website under22 categories Among them there are 23755 questions have answers and 1950questions have not been answered yet The raw data set is stored in resultcsv Inthis file the first column is answer and the second column is question For thecrawled data we remove the sensitive information such as age and date of birththe hyperlinks etc Instead of replacing other words for this sensitive informa-tion we leave the area blank since other words may affect the results Here aresome examples My month old son was diagnosed with some allergies a few weeks agoand convert ldquoThere are no cures but you can manage it with antihistamines and or desen-tisization (httpwwwhealthscoutcomency6898mainhtmlTreatmentofAllergies)to ldquoThere are no cures but you can manage it with antihistamines andor desentisizationrdquoThe sentence length can be found in Table 41 From the table we can see the av-erage length of three different types is close together they are all around 85 to 95words every sentence However the maximum length has a huge difference Whenwe check the original website we find some usersrsquo questions are very detailed sincethey provide the support information for their queries as much as possible Alsothe experts provide answers step by step corresponding to patientsrsquo questions andtelling them what they should do and not to do Thus we keep these informativesentences when we do the experiments We also find some answers are incompletesuch as Good morning or Hi Lucy For these answers we treat their questions asno-answered questions since the expertsrsquo information cannot solve patientsrsquo queries

2Scrapy httpsscrapyorg

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 36: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect43 Analysis for Data set 25

Average Length Maximum LengthAnswered Questions 9540 5417Unanswered Questions 8502 1890Answers 9258 6441

Table 41 The length of questions and answers

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 37: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

26 Construction of Health24 Data set

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 38: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 5

Methodology

In this chapter we describe the pre-processing of the input sentences and the pro-posed method for this project In addition the detailed information about fine-tuning and three scoring methods for filtering the similar questions are explained aswell We also mentioned the evaluation metrics

51 Pre-processing

There are many medical concepts in the sentences the forms they can representby either abbreviations or multi-word expressions but they all stand for the samemeaning To expand the abbreviations and make the sentences to have informa-tion as much as possible(eg expand from BP to blood pressure) we use ScispaCy[Neumann et al 2019] an industrial-strength natural language processing approachfor processing biomedical scientific or clinical text The model here we use isen_ner_bionlp13cg_md since it is trained on the BIONLP13CG corpus

When we use TF-IDF BM25 and Word2Vec to calculate the similarity scores be-tween the questions we also do pre-processing for the original questions We firstlyconvert all the words into lowercase Then we remove any punctuation Follow-ing that we use the word tokenization to tokenize the words in order to split thesentences into individual units We also remove the stop words in the sentencesBesides we apply stemming for the words in order to transform the words into theirbases or roots Through these steps the raw questions can be transformed into amore suitable one for the following tasks

52 Fine-Tuning

For the downstream tasks it is easy to fine-tune the pre-trained model since we onlyneed to plug in the input and output into BERT and fine-tune all the parametersend-to-end [Devlin et al 2018] BERT prepends a [CLS] stands for classification tothe start of each sentence So when doing the classification problem we only needto feed the [CLS] representation into any arbitrary classifier to get the results insteadof summing all the wordsrsquo representations together Besides RQE and NLI are both

27

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 39: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

28 Methodology

classification problems The only difference is one is the binary classification andthe other one is the three-label classification Thus they can share one pre-trainedmodel

When fine-tuning the model for RQE task the data set comprises consumer healthquestions (CHQs) received by the US National Library of medicine (NLM) and Fre-quently Asked Questions (FAQs) from NIH institutes [Abacha et al 2019] There arerespectively 8890 302 and 230 medical question pairs in training validation andtest sets created by Abacha and Demner-Fushman [2016] and the labels are true andfalse After analyzing these data we find there is a mismatch between training andvalidation sets for example the entailing examples in the training set are simpleand the non-entailing examples are completely unrelated However the sentences inthe validation set are complex To reduce the gap and help the model with bettergeneralization we combine all the data sets to do fine-tuning

The data set used for NLI task is called MedNLI which is derived from MIMIC-III [Romanov and Shivade 2018] This set includes 11232 1395 and 1422 questionpairs respectively for training validation and test sets and the labels are contradic-tion entailment and neutral Same as the previous one we combine all the sets andcross-validation is used as well Besides CHQs and MedNLI are the balanced datasets so the model will not bias to the majority class

For here we use SciBERT a pre-trained model for scientific text since the modelimproves the performance on a range of NLP tasks in the scientific domain [Belt-agy et al 2019] When fine-tuning the pre-trained model we take the suggestionsfrom Devlin et al [2018] that setting the Batch size as 16 gradient accumulation stepas 4 and the learning rate as 2e-5 for both two tasks Besides we set epoch for 1and 2 respectively for RQE and NLI tasks The fine-tuning code is modified basedon run_classifierpy which is provided by huggingface 1

53 Method

To crowdsource high-quality data in Health 24 we use RQE and NLI tasks Generallyspeaking our goal is to find the existing answers for the unanswered questions inthe crawled data set We use the fine-tuned RQE model to label if the answeredquestions are similar to the unanswered questions in Health24 After that from thesimilar question pairs (answered questions and unanswered questions) generatedby RQE we use NLI to validate if the similar questionsrsquo answers can be inferredfrom their unanswered questions Since there is no ground truth for both tasks weuse the idea of Pseudo-relevance feedback In general Pseudo-relevance feedback isfrequently used in the information retrieval domain Under the assumption that thetop-ranked documents contain many useful terms which can help to discriminate therelevant documents from irrelevant ones Pseudo-relevance feedback is to formulatenew queries for the second round retrieval by extracting expansion terms from thetop-ranked documents in the first round Through this step the overall performance

1Huggingface GitHub httpsgithubcomhuggingfacetransformers

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 40: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect54 Similarity Scores Methods 29

can be improved since some relevant documents missed in the first round can beretrieved in the second round [Cao et al 2008] In our project we mimic this way totest if the generated data have high accuracy

The detailed processes of the two tasks are shown in Figure 51 For RQE partWe use CHQs data to fine-tune the pre-trained BERT model Then we use the samedata to evaluate the model by using k-fold cross-validation and record the accu-racy We use the fine-tuned model to generate labels for the test set by insertingthe unanswered questions and the potential similar answered questions After thatwe add the generated question pairs with labels into the original CHQs data set tore-evaluate the model and see if the accuracy changes or not One intuitive way tobuild the testing set is to pair one unanswered question with all answered questionsIf there are 10 unanswered questions and 100 answered questions the question pairswill be 1000 However not each of them is similar To reduce the running time aswell as time complexity and improve efficiency we use three similarity scoring meth-ods (TF-IDF BM25 and Word2Vec) to calculate the similarity between questions Weonly take the top-five questions pairs This step can reduce the number of pairs in thetest set by filtering some obviously dissimilar questions pairs The detailed informa-tion is introduced in Section 54 Besides one thing that needs to be noted is that theadded data must be balanced Otherwise the model will bias toward the majorityclass and the result will not be accuracy As for NLI task the steps are similar toRQE task Here is an assumption We assume if the unanswered question and theanswered question are similar then the answer of the answered question has someprobability that can be inferred from the unanswered question Otherwise it cannotbe inferred from the unanswered question Thus the test set in NLI task is formedby the unanswered questions and their similar answered questionsrsquo answers

When using the fine-tuned model to generate labels for the test sets not everypair is allocated to a correct label To reduce noisy data in the training data setwe set the thresholds in the output [Bishop 2006] Instead of setting labels for themaximum probabilities we set 06 and 04 as the thresholds for binary (RQE) andthree-label (NLI) classifications respectively This method can help us to solve someambiguous cases to reduce the misclassification rate If the maximum probability foreach pair is lower than the threshold this pair will be rejected Otherwise we willaccept this pair In general the rejected data will be classified by experts Howeverdue to the limited time of the project and no experts the rejected data are removedfrom the data set

54 Similarity Scores Methods

541 Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF stands for Term Frequency-Inverse Document Frequency it is a statisticalmeasure that reflects how important a word is to a document in a collection of cor-pus [Rajaraman and Ullman 2011] It consists of two parts one is term frequencymeasures the number of times the word occurs in a particular document The other

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 41: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

30 Methodology

one is inverse document frequency which calculates how much information the wordprovides If the result is close to 0 that means it is a common word and vice versaThese two concepts forms TF-IDF here is the formula

TFminus IDF(t d D) = TF(t d)times IDF(t D)

TF(t d) = log(1 + f req(t d))

IDF(t D) =N

count(d isin D t isin d)

where t is the word we want to calculate d is its document D is the sets of doc-uments and N is the total number of the documents TF-IDF can only be used tomeasure one word if we want to calculate the similarity between two sentences wecan use cosine similarity

similarity(A B) =A middot B|A| times |B|

where A and B are both vectors In this case each element in vector A and Brepresent one wordrsquos TF-IDF

542 Best Matching 25 (BM25)

Okapi BM25 (Best Match 25) [Robertson et al 1995] is another way to find the similardocuments for a particular given document or query Here is the formula

BM25(D Q) =n

sumi=1

IDF(qi D)f (qi D) middot (k1 + 1)

f (qi) + k1 middot (1minus b + b middot |D|davg)

where

bull D is the document and Q is the query qi is the term in the query Q

bull f (qi D) is the number of times qi occurs in the document D

bull |D| is the length of the document

bull davg is the average length of a document

bull k1 and b both are hyper-parameters In general k1 = 2 b = 075

543 Word2Vec

We introduce Word2Vec in section 231 which is a way of static word embeddingIn the medical domain there has been work to provide high-quality domain specificembeddings For here we use the pre-trained embedding which created by Chiuet al [2016] to convert the words into numeric representations In this module 22million 200-dimensional word vectors are trained on PubMed (289 tokens) [Tilbury

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 42: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

sect55 Evaluation Metrics 31

2018] For each sentence we add each word embedding vector together and thenuse cosine similarity to calculate the similarity between the sentences

55 Evaluation Metrics

For evaluation measure we use the accuracy recall and confusion matrix to mea-sure the performance of the models Confusion matrix is a table to describe theperformance of the classification model on the test data set which the true valuesare known Here we only show the binary confusion matrix since it is easier to un-derstand (Table 51) In the table the row represents the predicated class and thecolumn represents the actual class TP means the instances align to the positive classFP means the true class is 0 and the predicated class is 1 FN is opposite to FP whichthe true class is 1 and the predicated class is 0 TN means the instances align tothe negative class Accuracy is to measure how often is the classifier correct and itsformula is accuracy = TP + TNtotal While recall is to measure the percentage oftotal relevant results correctly classified Its formula is TPTP + FN As for multi-categories classification the confusion matrix is more complex but it has the sameidea

Actual Positive (1) Actual Negative (0)Predicated Positive (1) TP FPPredicated Negative (0) FN TN

Table 51 Table for the confusion matrix

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 43: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

32 Methodology

Figure 51 The diagram shows the process of how RQE and NLI tasks work Theupper one is RQE and the bottom one is NLI For RQE task step 1 use CHQs data tofine-tune BERT model Step 2 use CHQs data to evaluate the model by using k-foldcross-validation and record each accuracy Step 3 use fine-tuned BERT model togenerate label for unanswered question and potential similar question Step 4 pairquestion pairs with generated labels Step 5 add the question pairs with labels toCHQs and evaluate the model again Then comparing the accuracy with the previousone As for NLI it has the same process except the input of testing is question-answer

pairs

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 44: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 6

Discussion on Results

Table 61 and 63 show the results for RQE and NLI tasks respectively For eachtask we implement three different similarity score methods to compare their resultsFrom these tables we can see that the average accuracy does not change significantlyafter adding extra data compared with the result of the original data In order tovalidate our approach we calculate the p-value This method can help us to deter-mine the significance of the results in relation to the null hypothesis If p-value is lessthan 005 the null hypothesis will be rejected and the alternative hypothesis will beaccepted since the value indicates strong evidence against the null hypothesis Thenull hypothesis states that there is no relationship between two variables and thealternative hypothesis states two variables have some relationships

In the context of the RQE task the null hypothesis is that the generated datacannot improve the accuracy The alternative hypothesis is the generated data canimprove the accuracy Before we get a p-value the t-test needs to be calculated Herewe calculate the unpaired t-test since we use different data set to do testing in eachiteration Below is the formula of unpaired t-test

Unpaired t-test =x1 minus x2radic

s2(1n1

+1n2

)

where

s2 =sumn1

i=1(xi minus x1)2 + sumn2

j=1(xj minus x2)2

n1 + n2 minus 2

From Table 62 we can see only the results of TF-IDF and BM25 are less than 005In this case we can reject the null hypothesis and accept the alternative hypothesiswhere the generated data can increase the accuracy Since we mimic the way ofpseudo-relevance feedback to evaluate the generated data these p-values can showthat the generated question pairs have high accuracy However Word2Vec is higherthan 005 We can explain this from two perspectives

bull From scoring methods perspective The pre-trained embedding model whichwe use in Word2Vec is created by training on PubMed The source in PubMedcomprises more than 30 million references and abstracts on biomedical topicsfrom MEDLINE These documents are written by experts so the words are all

33

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 45: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

34 Discussion on Results

formal with fewer typos However the questions in Health24 are asked bynormal people colloquial words and typos cannot be avoided There exists amismatch between the words in the pre-trained model and the words in the realquestions Thus some sentence vectors will lack important information due tothe reason that the words in the real questions cannot find their correspondentvectors in the pre-trained model

bull From data perspective due to the inaccurate word embedding method we findthe majority label of RQE model generated is 0 In other words most of thequestion pairs are not similar In order to make balanced data only a small partof them are selected So that is the reason why the added data for Word2Vec isthe smallest Furthermore the small size added data might not help the modelto fine-tune very well compared with the added data with large size

On the other hand recall is to measure the percentage of total relevant resultscorrectly classified We find the average recall of the original data (CHQs) is 09785After adding extra data extracted by Word2Vec method the average recall does notchange very much It is 09786 This value can reflect that the modelrsquos ability doesnot decrease because of the added data Namely the added data do not have muchnoise

CHQs CHQs +TF-IDF CHQs + BM25 CHQs + Word2VecIteration 1 09682 09709 09693 09710Iteration 2 09655 09731 09708 09699Iteration 3 09676 09742 09747 09710Iteration 4 09682 09692 09709 09726Iteration 5 09731 09775 09743 09715Average 09685 09720 09746 09712

Table 61 Shows the accuracy in different methods for RQE task

added data number p-vlaueTF-IDF 8084 00380BM25 7390 00499Word2Vec 1830 00514

Table 62 Shows the added data number and p-values for RQE task

Table 64 shows added data number and p-values of NLI task We find all valuesare larger than 005 so we cannot reject the null hypothesis We think the size of theadded data is the majority problem since the largest number is only 1287 Howeverin RQE task the added data is more than 8000 There are two condition as we selectadded data for NLI task

bull First we assume only the answers of answered questions in the true questionpairs of RQE task can have some probabilities that can be inferred from their

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 46: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

35

relevant unanswered questions If the two questions are not similar these prob-abilities are very low so we do not consider this case

bull Second the added data must be balanced Otherwise they will bias the modelIn this case we use the smallest number with the class in the result of testingpart as the base to select for added data For example the results show thereare 10 pairs for label 0 5 pairs for label 1 and 15 pairs for label 2 We take 5as the number of each class to generate added data in NLI task On the otherhand NLI task is three-label classification problem Compared with the binaryclassification problem each class for NLI does not have many data

Those are the reasons why the added data for NLI task are smallTable 65 demonstrates the standard deviations of the loss over four methods

When combining Table 64 with this we find the more added data are the largervalue is We think the reason is different kinds of data are combined for trainingand they bias the modelrsquos ability More precisely the added data are question-answerpairs but the original MedNLI data are sentence pairs These two kinds of data havedifferent formats Also the number of extra data is not large (the largest one is onlyabout 1287) So as we add the extra data into the original data it is equally to addnovel data or noises for the model The estimatorrsquos ability would be affected andcause high variance

On the other hand due to the limited hardware we only split the data into 5 partsin k-fold cross-validation so we get five accuracy for each method However thesmall sample size cannot accurately represent the p-value because the more samplesare the more accurate p-value is

MedNLI MedNLI +TF-IDF MedNLI + BM25 MedNLI + Word2VecIteration 1 07527 07601 07562 07453Iteration 2 07522 07426 07426 07515Iteration 3 07504 07469 07554 07540Iteration 4 07402 07593 07536 07430Iteration 5 07640 07657 07586 07569Average 07523 07550 07533 07501

Table 63 Shows the accuracy in different methods for NLI task

added data number p-vlaueTF-IDF 1 287 03340BM25 1149 04191Word2Vec 459 06661

Table 64 Shows the added data number and p-values for NLI task

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 47: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

36 Discussion on Results

Standard deviation of lossMedNLI 001392MedNLI+TF-IDF 002166MedNLI+BM25 001726MedNLI+word2vec 001351

Table 65 Shows the standard derivation of loss for four methods

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 48: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Chapter 7

Conclusion and Future Work

In this chapter I conclude my project and raise up some ideas which the future workcan do to improve this project

71 Conclusion

Question answering techniques have been widely investigated in the open domainsExtending these open domains into the biomedical domain such as patient questionanswering has many challenges such as the limited number of annotated data thequality of the existing data and the content of the data To address some of theseissues and construct data set which can be used in the data-driven approaches in thisdomain we propose a novel way to crowdsource data by using RQE and NLI Thedata set we use is crawled from Health 24 Within this data set there are 23771 ques-tions with answers and 1950 questions without answers Our goal is to find answersfrom existing answers for these 1950 questions Basically we use RQE to locate sim-ilar questions to the unanswered questions Intuitively to find if two questions aresimilar we can pair one unanswered question with all the answered questions How-ever this can generate a massive amount of question pairs To reduce the runningtime and increase efficiency we use three similarity score methods before applyingRQE task This step can help us to remove some dissimilar question pairs roughlyThen we use NLI to validate if the similar questionsrsquo answers can be inferred fromtheir corresponding no-answered questions

As for the procedure we first fine-tune the pre-trained BERT model by usingthe official data sets (CHQs and MedNLI) Then we use the fine-tuned model togenerate labels for RQE and NLI tasks Due to the reason that we do not have theground truth to evaluate the accuracy of the constructed data we use the idea ofpseudo-relevance feedback to evaluate More specifically we insert the constructeddata into the official data to train the model again If the accuracy increases it canreflect that the constructed data have high accuracy Otherwise they do not have highaccuracy We also set thresholds (06 04) in the output of fine-tuned RQE and NLImodels to reduce the misclassification rate If the maximal probability is lower thanthe threshold this case will be rejected Through implementing the experiments wefind the proposed method can construct high-quality question pairs for BM25 and

37

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 49: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

38 Conclusion and Future Work

TF-IDF methods in RQE task For NLI task the method is more limited in NLI taskbecause of small size of added data However they still play a role in NLI task

72 Future Word

Because of the limited time of the project there are many places need to improve

bull For this project we only have 1950 questions that do not have answers Basedon the two conditions applied in NLI task the number of the added data be-comes smaller Besides BERT is a complex model It is easy to cause overfittingby using the small size of data So for the next step we can expand our dataThen it will be more persuasive to validate our idea

bull In the previous section we think the added data for NLI task is different fromthe original NLI data For the next step we can find a more suitable data witha similar type of ours

bull As we mentioned in Chapter 6 the pre-trained Word2Vec embeddings havesome limitations For the next step we can optimize the way of using Word2Vec

bull Fine-tuning and generating labels are done by using CoLab Due to the limitedhardware and limited time we do not find the best hyperparameters for thesemodels For the next step we can do hyperparameter optimization Also wecan increase k in k-fold cross validation

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 50: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Bibliography

Abacha A B and Demner-Fushman D 2016 Recognizing question entailmentfor medical question answering In AMIA Annual Symposium Proceedings vol 2016310 American Medical Informatics Association (cited on pages 16 19 and 28)

Abacha A B Shivade C and Demner-Fushman D 2019 Overview of themediqa 2019 shared task on textual inference question entailment and questionanswering In Proceedings of the 18th BioNLP Workshop and Shared Task 370ndash379(cited on pages 19 20 and 28)

Bach S H Rodriguez D Liu Y Luo C Shao H Xia C Sen S RatnerA Hancock B Alborzi H et al 2019 Snorkel drybell A case study in de-ploying weak supervision at industrial scale In Proceedings of the 2019 InternationalConference on Management of Data 362ndash375 (cited on page 20)

Beltagy I Lo K and Cohan A 2019 Scibert A pretrained language model forscientific text In Proceedings of the 2019 Conference on Empirical Methods in NaturalLanguage Processing and the 9th International Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP) 3606ndash3611 (cited on page 28)

Bengio Y Ducharme R Vincent P and Jauvin C 2003 A neural probabilisticlanguage model Journal of machine learning research 3 Feb (2003) 1137ndash1155 (citedon pages ix 5 6 and 10)

Bhaskar S A Rungta R Route J Nyberg E and Mitamura T 2019 Sieg atmediqa 2019 Multi-task neural ensemble for biomedical inference and entailmentIn Proceedings of the 18th BioNLP Workshop and Shared Task 462ndash470 (cited on pages19 and 20)

Bishop C M 2006 Pattern recognition and machine learning springer (cited on page29)

Cao G Nie J-Y Gao J and Robertson S 2008 Selecting good expansion termsfor pseudo-relevance feedback In Proceedings of the 31st annual international ACMSIGIR conference on Research and development in information retrieval 243ndash250 (citedon page 29)

Chien T and Kalita J 2020 Adversarial analysis of natural language inferencesystems In 2020 IEEE 14th International Conference on Semantic Computing (ICSC)1ndash8 IEEE (cited on page 2)

39

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 51: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

40 BIBLIOGRAPHY

Chiu B Crichton G Korhonen A and Pyysalo S 2016 How to train goodword embeddings for biomedical nlp In Proceedings of the 15th workshop on biomed-ical natural language processing 166ndash174 (cited on page 30)

Cocco A M Zordan R Taylor D M Weiland T J Dilley S J Kant JDombagolla M Hendarto A Lai F and Hutton J 2018 Dr google inthe ed searching for online health information by adult emergency departmentpatients Medical Journal of Australia 209 8 (2018) 342ndash347 (cited on page 1)

Devlin J Chang M-W Lee K and Toutanova K 2018 Bert Pre-trainingof deep bidirectional transformers for language understanding arXiv preprintarXiv181004805 (2018) (cited on pages 2 15 27 and 28)

Dong L Yang N Wang W Wei F Liu X Wang Y Gao J Zhou M and

Hon H-W 2019 Unified language model pre-training for natural language un-derstanding and generation In Advances in Neural Information Processing Systems13042ndash13054 (cited on page 13)

Erhan D Bengio Y Courville A Manzagol P-A Vincent P and Bengio S2010 Why does unsupervised pre-training help deep learning Journal of MachineLearning Research 11 Feb (2010) 625ndash660 (cited on pages 2 and 13)

Graves A Mohamed A-r and Hinton G 2013 Speech recognition with deeprecurrent neural networks In 2013 IEEE international conference on acoustics speechand signal processing 6645ndash6649 IEEE (cited on pages ix 7 and 8)

Gururangan S Swayamdipta S Levy O Schwartz R Bowman S R and

Smith N A 2018 Annotation artifacts in natural language inference data arXivpreprint arXiv180302324 (2018) (cited on page 1)

Josht P A step-by-step nlp guide to learn elmo for extractingfeatures from text httpswwwanalyticsvidhyacomblog201903learn-to-use-elmo-to-extract-features-from-text (cited on pages ix and 14)

Kumar V B Srinivasan A Chaudhary A Route J Mitamura T and Ny-berg E 2019 Dr quad at mediqa 2019 Towards textual inference and questionentailment using contextualized representations arXiv preprint arXiv190710136(2019) (cited on page 20)

Lende S P and Raghuwanshi M 2016 Question answering system on educationacts using nlp techniques In 2016 World Conference on Futuristic Trends in Researchand Innovation for Social Welfare (Startup Conclave) 1ndash6 IEEE (cited on page 1)

Luong M-T Sutskever I Le Q V Vinyals O and Zaremba W 2014 Ad-dressing the rare word problem in neural machine translation arXiv preprintarXiv14108206 (2014) (cited on page 2)

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 52: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

BIBLIOGRAPHY 41

Mikolov T Chen K Corrado G and Dean J 2013 Efficient estimation ofword representations in vector space arXiv preprint arXiv13013781 (2013) (citedon pages ix 10 and 11)

Neumann M King D Beltagy I and Ammar W 2019 Scispacy Fastand robust models for biomedical natural language processing arXiv preprintarXiv190207669 (2019) (cited on page 27)

Nguyen V 2019 Question answering in the biomedical domain In Proceedings of the57th Annual Meeting of the Association for Computational Linguistics Student ResearchWorkshop 54ndash63 (cited on pages 1 and 2)

Nguyen V Karimi S and Xing Z 2019 Investigating the effect of lexical segmen-tation in transformer-based models on medical datasets In Proceedings of the The17th Annual Workshop of the Australasian Language Technology Association 165ndash171(cited on page 2)

Pampari A Raghavan P Liang J and Peng J 2018 emrqa A large corpus forquestion answering on electronic medical records arXiv preprint arXiv180900732(2018) (cited on page 1)

Pennington J Socher R and Manning C D 2014 Glove Global vectors forword representation In Proceedings of the 2014 conference on empirical methods innatural language processing (EMNLP) 1532ndash1543 (cited on pages ix 11 and 12)

Peters M Ruder S and Smith N A 2019 To tune or not to tune adaptingpretrained representations to diverse tasks arXiv preprint arXiv190305987 (2019)(cited on page 15)

Peters M E Neumann M Iyyer M Gardner M Clark C Lee K and

Zettlemoyer L 2018 Deep contextualized word representations arXiv preprintarXiv180205365 (2018) (cited on pages 13 14 and 15)

Rajaraman A and Ullman J D 2011 Mining of massive datasets CambridgeUniversity Press (cited on page 29)

Rajpurkar P Zhang J Lopyrev K and Liang P 2016 Squad 100000+ ques-tions for machine comprehension of text arXiv preprint arXiv160605250 (2016)(cited on page 1)

Robertson S E Walker S Jones S Hancock-Beaulieu M M Gatford Met al 1995 Okapi at trec-3 Nist Special Publication Sp 109 (1995) 109 (cited onpage 30)

Romanov A and Shivade C 2018 Lessons from natural language inference in theclinical domain arXiv preprint arXiv180806752 (2018) (cited on pages 1 17 20and 28)

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 53: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

42 BIBLIOGRAPHY

Shen S Li Y Du N Wu X Xie Y Ge S Yang T Wang K Liang X and

Fan W 2018 On the generation of medical question-answer pairs arXiv preprintarXiv181100681 (2018) (cited on page 21)

Tilbury K 2018 Word embeddings for domain specific semantic relatedness (2018)(cited on page 30)

Vaswani A Shazeer N Parmar N Uszkoreit J Jones L Gomez A NKaiser Ł and Polosukhin I 2017 Attention is all you need In Advances inneural information processing systems 5998ndash6008 (cited on pages ix 8 and 9)

Wang B and Kuo C-C J 2020 Sbert-wk A sentence embedding method bydissecting bert-based word models arXiv preprint arXiv200206652 (2020) (citedon page 13)

Wu Z Song Y Huang S Tian Y and Xia F 2019 Wtmed at mediqa 2019A hybrid approach to biomedical natural language inference In Proceedings of the18th BioNLP Workshop and Shared Task 415ndash426 (cited on page 20)

Zhu W Zhou X Wang K Luo X Li X Ni Y and Xie G 2019 Panlpat mediqa 2019 Pre-trained language models transfer learning and knowledgedistillation In Proceedings of the 18th BioNLP Workshop and Shared Task 380ndash388(cited on page 19)

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 54: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Appendices

43

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 55: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Appendix

Appendix 1 Project Descriptionand Contract

45

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 56: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

From Zhenchang Xing ltzhenchangxinganueduaugtSubject FW project examinerDate 29 July 2019 at 92306 am AESTTo Xi Chen ltu6013686anueduaugt Jingjing Shi ltu5792801anueduaugt Yuchao Zhang ltu6715243anueduaugt

Hi Please use this email as the confirmation for your project examiner Thanks BestZhenchang From Yu Lin [mailtoyulinanueduau] Sent Sunday 28 July 2019 140 PMTo Zhenchang XingSubject Re project examiner Hi Zhenchang Thanks for your email I am happy to be the examiner for these students CheersYu On Sun Jul 28 2019 at 1009 AM Zhenchang Xing ltzhenchangxinganueduaugt wroteHi Yu Can I invite you as the project examiner for the two projects Thanks - Xi Chen Deep Learning Based User Interface Search- Yuchao Zhang Multi-Task Learning for Across Domain Question Answering- Jingjing Shi Question Entailment for Crowdsourcing More Data BestZhenchang

-- ---------------------------------------------------------------Yu LinRoom 425 Building 145Research School of Computer ScienceThe Australian National UniversityOffice +61 2612 54509Email yulinanueduau

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 57: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Appendix

Appendix 2 Artefact Description

The project consists of three parts one is the folders which store source data and thegenerated data one is py files and the other one is ipynb files

For foldersdata it includes the original data files and the generated data files by usng RQE

and NLIhealth24 Figure 1 is the tree structure of health24 file itempy middlewarepy

pipelinespy settingspy and healthpy are the files to crawl data from Health 24Website resultcsv stores the result In this folder the framework is provided byScrapy 1

raw_nli_data and raw_rqe_data store the source dataresult store the generated labels by using RQE and NLI test model

For py filesbm25py tfidfpy and word_embeddingpy are three files to filter the similar

question pairs The results are most_similar_bm25csv most_similar_tfidfcsv andmost_similar_word_emcsv tfidfpy is modified based on ChineseSimilaritygensim-tfidf 2

rqe_data_generatorpy and nli_data_generatorpy are the files to generate suit-able dataframes when fine-tuning and testing Bert model

convert_examples_to_featurespy and toolpy are the files to store feature func-tion and processors when fine-tuning and testing Bert model These codes are fromrun_classifierpy in huggingface3

load_filepy convert the raw source data to dataframepre_processed_data pre-processing the sentences

For ipynb files

1Scrapy httpsscrapyorg2ChineseSimilarity-gensim-tfidf httpsgithubcomyip522364642

ChineseSimilarity-gensim-tfidfblobmasterChineseSimilarityCalculationpy3run_classifierpy httpsgithubcomhuggingfacetransformersblob

7cc35c31040d8bdfcadc274c087d6a73c2036210examplesrun_classifierpy

49

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 58: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

50 Appendix 2 Artefact Description

Figure 1 The tree structure of health24 file

train_bertipynb and testipynb the former one is to fine-tune Bert model anddo k-fold cross validation The latter is to generate label for the test sets by usingfine-tuned mode These codes are modified based on run_classifierpy in hugging-face

py files run in PyCharm ipynb files run in Google Colab4

4Colab httpscolabresearchgooglecom

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 59: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

Appendix

Appendix 3 Readme File

For this project it contains two kind of files one is py file and the other one isipynbfile For py files they are used to process the data For ipynb files they are used tofine-tuning evaluating and generating labels ipynb files run in CoLab

Preparationpip install jsonlinespip install scrappip install scispacypip install tqdmpip install genismpip install pandaspip install numpypip install nltkpip install httpss3-us-west-2amazonawscomai2-s2-scispacyreleasesv024en_ner_bionlp13cg_md-024targz

Run health24cd health24scrapy crawl healthCovert raw rqe data and nli data to csv filepython load_filepy (this may take long time)Use TF-IDF to find similar question pairspython tfidfpyUse Bm25 to find similar question pairspython bm25pyUse word2vec to find similar question pairspython word_embeddingpyConvert generated similarly question pairs to data frame (this part may take longtime)python rqe_convert_to_dfpy

Fine-tune model1 unzip pre-trained model scibert_scivocab_uncasedzip and upload the folder(scibert_scivocab_uncased) to google drive

51

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File
Page 60: Applying Natural Language Inference or Question Entailment ... · Inference or Question Entailment for Crowdsourcing More Data Jingjing Shi A report submitted for the course COMP8755

52 Appendix 3 Readme File

2 Upload fine-tuneipynb to Colab3 Zip data folder upload datazip convert_examples_to_featurespy and toolpy tofine-tuneipynb4 Run each call in fine-tuneipynb5 If want to fine-tune the rqe model run the second last cell If want to find-tunethe nli model run the last cell At this time we run both The fine-tuned models willbe uploaded to google drive

Evaluate original rqe and nli data1 Upload evaluate_bertipynb to CoLab2 Zip data folder then upload datazip convert_examples_to_featurespy and toolpyto evaluate_bertipynb3 follow the instruction in Colab to evaluate original rqe and nli data

To run evaluate_bertipynb and testipynb1 Upload evaluate_bertipynb and testipynb to colab2 Zip data folder upload datazip convert_examples_to_featurespy and toolpy toevaluate_bertipynb and testipynb3 Run all the cells4 labelcsv files are the results of generated labels for pairs They can also be foundin result file

  • Acknowledgments
  • Abstract
  • Contents
  • Introduction
    • Motivation
    • Main Contribution
    • Outline
      • Background
        • Background on NLP
          • Neural Network Language Model
          • RNN and LSTM
          • Transformer
            • Word Embedding
            • Static Word Embedding
              • Word2Vec
              • GloVe
                • Dynamic Word Embedding
                  • Pre-training
                  • ELMO
                  • BERT
                    • Tasks
                      • Recognizing question entailment (RQE)
                      • Natural Language Inference (NLI)
                          • Related Work
                            • RQE
                            • NLI
                            • Augment Data Size
                              • Construction of Health24 Data set
                                • Motivation
                                • Collecting Data set
                                • Analysis for Data set
                                  • Methodology
                                    • Pre-processing
                                    • Fine-Tuning
                                    • Method
                                    • Similarity Scores Methods
                                      • Term Frequency-Inverse Document Frequency (TF-IDF)
                                      • Best Matching 25 (BM25)
                                      • Word2Vec
                                        • Evaluation Metrics
                                          • Discussion on Results
                                          • Conclusion and Future Work
                                            • Conclusion
                                            • Future Word
                                              • Bibliography
                                              • Appendices
                                              • Appendix 1 Project Description and Contract
                                              • Appendix 2 Artefact Description
                                              • Appendix 3 Readme File