11
Proceedings of the First Workshop on Scholarly Document Processing, pages 127–137 Online, November 19, 2020. c 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/P17 127 ERLKG: Entity Representation Learning and Knowledge Graph based association analysis of COVID-19 through mining of unstructured biomedical corpora Sayantan Basu 1 Sinchani Chakraborty 2 Atif Hassan 3 Sana Siddique 4 Ashish Anand 5 1,5 Indian Institute of Technology Guwahati 2,3 Indian Institute of Technology Kharagpur 4 Eras Lucknow Medical College and Hospital 1 [email protected] 2 [email protected] Abstract We introduce a generic, human-out-of-the- loop pipeline, ERLKG, to perform rapid as- sociation analysis of any biomedical entity with other existing entities from a corpora of the same domain. Our pipeline consists of a Knowledge Graph (KG) created from the Open Source CORD-19 dataset by fully au- tomating the procedure of information extrac- tion using SciBERT. The best latent entity rep- resentations are then found by benchnmark- ing different KG embedding techniques on the task of link prediction using a Graph Con- volution Network Auto Encoder (GCN-AE). We demonstrate the utility of ERLKG with re- spect to COVID-19 through multiple qualita- tive evaluations. Due to the lack of a gold standard, we propose a relatively large intrin- sic evaluation dataset for COVID-19 and use it for validating the top two performing KG embedding techniques. We find TransD to be the best performing KG embedding technique with Pearson and Spearman correlation scores of 0.4348 and 0.4570 respectively. We demon- strate that a considerable number of ERLKG’s top protein, chemical and disease predictions are currently in consideration for COVID-19 related research. 1 Introduction COVID-19 is a global epidemic with a considerable fatality rate and a high transmission rate, affecting millions of people world-wide since its outbreak. 1 The search for treatments and possible cures for the novel Coronavirus (Wang et al., 2020b) has led to an exponential increase in scientific publica- tions, but the challenge lies in effectively process- ing, integrating and leveraging related sources of information. 1 https://www.who.int/docs/default- source/coronaviruse/situation-reports/20200811-covid- 19-sitrep-204.pdf?sfvrsn=1f4383dd 2 Rapid and effective utilization of literature dur- ing times of pandemic such as COVID-19 is of utmost importance in combating the disease. In this paper, we introduce a fully automated generic pipeline consisting of an Information Extraction (IE) system followed by Knowledge Graph con- struction. The IE module uses SciBERT (Beltagy et al., 2019) for performing Named Entity Recog- nition (NER) and Relationship Extraction (RE). The entire entity extraction procedure is fully au- tomated and no human expertise is used. The ma- jor goal is to ensure rapid access of relevant data through a structured representation of free text ar- ticles. Following this, we focus on the task of as- sociation analysis of essential biomedical entities, namely, proteins, diseases and, chemicals. Such entities are well explored in existing literature and an analysis of their relatedness to COVID-19 is pro- vided by leveraging the CORD-19 Open Research Dataset (Wang et al., 2020a). This can assist the physicians to accelerate knowledge discovery and provide support for clinical decision making. The dataset and related resources of this paper are made public 2 . Due to a lack of gold standard information, we perform extensive qualitative evaluations in order to show that our system does not suffer from re- dundancy or bias. These evaluations include per- formance on a link prediction task and intrinsic evaluation. For the former, KG embeddings along with graph adjacency matrix are fed to a GCN-AE (Kipf and Welling, 2016) model to perform link pre- diction. Average Precision (AP) and ROC scores were used to benchmark different KG embeddings on the generated knowledge graph. For the intrin- sic evaluation, we propose a new dataset that has been developed with the help of three physicians and benchmark our embeddings against it. Finally, 2 https://github.com/sayantanbasu05/ERKLG

ERLKG: Entity Representation Learning and Knowledge Graph

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ERLKG: Entity Representation Learning and Knowledge Graph

Proceedings of the First Workshop on Scholarly Document Processing, pages 127–137Online, November 19, 2020. c©2020 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17

127

ERLKG: Entity Representation Learning and Knowledge Graph basedassociation analysis of COVID-19 through mining of unstructured

biomedical corpora

Sayantan Basu 1 Sinchani Chakraborty 2 Atif Hassan 3

Sana Siddique 4 Ashish Anand 5

1,5 Indian Institute of Technology Guwahati2,3 Indian Institute of Technology Kharagpur

4 Eras Lucknow Medical College and Hospital1 [email protected] [email protected]

Abstract

We introduce a generic, human-out-of-the-loop pipeline, ERLKG, to perform rapid as-sociation analysis of any biomedical entitywith other existing entities from a corpora ofthe same domain. Our pipeline consists ofa Knowledge Graph (KG) created from theOpen Source CORD-19 dataset by fully au-tomating the procedure of information extrac-tion using SciBERT. The best latent entity rep-resentations are then found by benchnmark-ing different KG embedding techniques on thetask of link prediction using a Graph Con-volution Network Auto Encoder (GCN-AE).We demonstrate the utility of ERLKG with re-spect to COVID-19 through multiple qualita-tive evaluations. Due to the lack of a goldstandard, we propose a relatively large intrin-sic evaluation dataset for COVID-19 and useit for validating the top two performing KGembedding techniques. We find TransD to bethe best performing KG embedding techniquewith Pearson and Spearman correlation scoresof 0.4348 and 0.4570 respectively. We demon-strate that a considerable number of ERLKG’stop protein, chemical and disease predictionsare currently in consideration for COVID-19related research.

1 Introduction

COVID-19 is a global epidemic with a considerablefatality rate and a high transmission rate, affectingmillions of people world-wide since its outbreak.1

The search for treatments and possible cures forthe novel Coronavirus (Wang et al., 2020b) hasled to an exponential increase in scientific publica-tions, but the challenge lies in effectively process-ing, integrating and leveraging related sources ofinformation.

1https://www.who.int/docs/default-source/coronaviruse/situation-reports/20200811-covid-19-sitrep-204.pdf?sfvrsn=1f4383dd 2

Rapid and effective utilization of literature dur-ing times of pandemic such as COVID-19 is ofutmost importance in combating the disease. Inthis paper, we introduce a fully automated genericpipeline consisting of an Information Extraction(IE) system followed by Knowledge Graph con-struction. The IE module uses SciBERT (Beltagyet al., 2019) for performing Named Entity Recog-nition (NER) and Relationship Extraction (RE).The entire entity extraction procedure is fully au-tomated and no human expertise is used. The ma-jor goal is to ensure rapid access of relevant datathrough a structured representation of free text ar-ticles. Following this, we focus on the task of as-sociation analysis of essential biomedical entities,namely, proteins, diseases and, chemicals. Suchentities are well explored in existing literature andan analysis of their relatedness to COVID-19 is pro-vided by leveraging the CORD-19 Open ResearchDataset (Wang et al., 2020a). This can assist thephysicians to accelerate knowledge discovery andprovide support for clinical decision making. Thedataset and related resources of this paper are madepublic2.

Due to a lack of gold standard information, weperform extensive qualitative evaluations in orderto show that our system does not suffer from re-dundancy or bias. These evaluations include per-formance on a link prediction task and intrinsicevaluation. For the former, KG embeddings alongwith graph adjacency matrix are fed to a GCN-AE(Kipf and Welling, 2016) model to perform link pre-diction. Average Precision (AP) and ROC scoreswere used to benchmark different KG embeddingson the generated knowledge graph. For the intrin-sic evaluation, we propose a new dataset that hasbeen developed with the help of three physiciansand benchmark our embeddings against it. Finally,

2https://github.com/sayantanbasu05/ERKLG

Page 2: ERLKG: Entity Representation Learning and Knowledge Graph

128

based on cosine similarity score, the best represen-tation was used to predict top chemicals, proteinsand diseases related to COVID-19. The contribu-tions of our approach are as follows :

1. We propose a fully automated, human-out-of-the-loop, end-to-end generic pipelinefor rapidly determining association of anybiomedical entity of interest with other ex-isting well explored entities.

2. We benchmark multiple KG embedding tech-niques on the task of link prediction anddemonstrate that simple embedding methodsprovide comparable performance on straight-forward structured KGs.

3. We introduce two human gold-standard entitylists, COV19 25 and COV19 729. The formerconsists of expert ratings for 25 entities pre-dicted by ERLKG while the latter consists ofexpert ratings for 729 entities sampled fromthe CORD-19 dataset. The ratings are basedon every entity’s relatedness with respect toCOVID-19.

2 Related Work

We mostly focus on recent works centered aroundthe CORD-19 dataset by discussing about the tech-niques used for IE and KG generation.

2.1 Entity and Relation Extraction

Most of the recent NLP systems use pretrainedlanguage models on unannotated text like ELMo(Peters et al., 2018), BERT (Devlin et al., 2019),and XLNet (Yang et al., 2019). In the biomedicaland clinical domains, BERT based architecturespretrained with domain-specific unlabelled texthave been used for IE (Lee et al., 2020; Alsentzeret al., 2019). The CORD-19 dataset, curated forthe COVID-19 pandemic, integrates related scien-tific articles for various information retrieval tasks(Roberts et al., 2020). Multiple NLP applicationshave been developed around CORD-19 like Ques-tion Answering (Das et al., 2020), Summarization(Park, 2020), NER (Wang et al., 2020c), etc.

2.2 Knowledge Graph

KGs were immensely used in different fields likeLife Science (Chen et al., 2009), Decision SupportSystem (Russell and Norvig, 2010) etc. Using theCORD-19 dataset and many other textual sources,

KGs have been built and used for performing dif-ferent tasks that aid in knowledge discovery. Chenet al. (2020) performs NER using BioBERT onCORD-19 and PubMed Dataset (Dernoncourt andLee, 2017) while developing a Coronavirus KGfrom PubMed KG based on two different meth-ods, namely, cosine similarity and co-occurencefrequency to predict plausible drugs. Wang et al.(2020b) construct a KG termed as COVID-KG, byextracting multimodal knowledge from existing sci-entific literature and ontology followed by a QAsystem, built on top of this information, with anaim to answer questions related to drug repurpos-ing. Comparatively smaller KGs have been con-structed for COVID-19 like (Domingo-Fernandezet al., 2020) which covers 145 articles consisting of3945 nodes and 9484 relations covering 10 entitytypes. Previously built KGs have also been em-ployed for COVID-19 drug discovery (Richardsonet al., 2020). However, the scope of the networkbuilt by the last two methods is limited owing to thesmaller dataset size. Also, to learn node representa-tions and leverage the structural information of thegraph, various techniques are used for KnowledgeGraph embeddings. Rossi et al. (2020) conductsextensive survey on 16 KG embedding techniquesto perform a comparative analysis. They form ataxonomy of the embedding methods, groupingvarious methods to tensor decomposition modelslike DisMult (Yang et al., 2015), Geometric mod-els like TransE (Bordes et al., 2013), TransD (Jiet al., 2015), ComplEx (Trouillon et al., 2016) andRotate (Sun et al., 2019) and Deep Learning mod-els like ConVE (Dettmers et al., 2018) and CapsE(Nguyen et al., 2019). Shifting from textual sourceto construct a KG, Ray et al. (2020) uses biologicalinteraction networks like drug-protein and protein-protein networks to predict repurposable drugs forSARS-CoV-2 through link prediction while em-ploying Variational Graph AutoEncoders with fea-tures from Node2Vec (Grover and Leskovec, 2016)for entity representation.

3 Dataset

3.1 CORD-19

The CORD-19 corpus (Wang et al., 2020a) waspublished by Allen AI in association with WhiteHouse and other organizations. It was made pub-licly available on the Kaggle 3 platform as a part

3https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge

Page 3: ERLKG: Entity Representation Learning and Knowledge Graph

129

of an open research challenge. The data, contain-ing scholarly articles, is collected from sourceslike PubMed Central (PMC), PubMed, the WorldHealth Organization’s COVID-19 Database, andvarious preprint servers like bioRxiv, medRxiv andarXiv.

CORD-19 corpus (2020-05-12) contains a poolof 1,38,000 scholarly articles with 69,000 full-textarticles related to COVID-19, SARS-CoV-2, etc.Each paper is associated with bibliographic meta-data such as Title, Author etc, as well as uniqueidentifiers such as a DOI, PubMed Central ID etc.Various sub-tasks have been identified for effec-tive information retrieval, however, it lacks taskoriented ground truth data. We merge all the meta-data with corresponding full text papers and retainthe title, abstract and full text from the corpus.

3.2 Datasets for Fine-tuning SciBERT

For NER, we consider the following three datasets,namely, JNLPBA (Collier and Kim, 2004) corpuswhich consists of 5 distinct tags: Protein, DNA,RNA, Cell line and Cell type, the CHEMDNER(Krallinger et al., 2015) corpus which consists of: Abbreviation, Family, Formula, Identifiers, Mul-tiple, Systematic and Trivial, the NCBI DiseaseCorpus (Dogan et al., 2014) which is used to iden-tify only disease mentions.

For RE, the following datasets are used, namely,CHEMPROT (Kringelum et al., 2016) whichconsists of 13 different relationship types basedon identified positive associations according to: Inhibitor, Substrate, Indirect-Down regula-tor, Indirect-Up regulator, Activator, Antagonist,Product-Of, Agonist, Down regulator, Up reg-ulator, Agonist-Activator, Agonist-Inhibitor andSubstrator-Product-Of and BC5CDR (Li et al.,2016) which captures binary relations predict-ing positive or negative interaction for chemical-induced-disease pairs.

4 ERLKG

In this section we discuss about the entire pipelineand its various components. Figure 1 depicts thepipeline which consists of the following modules :Preprocessing, Named Entity Recognition (NER),Relation Extraction (RE) and Knowledge Graph(KG) construction. The rest part of the Figure 1 de-picts the evaluation strategies adopted for a reliableassociation analysis of various chemical, proteinand drug entities from CORD-19 corpus with re-

spect to COVID-19.

4.1 Preprocessing

Each abstract or full text was split into sentencesusing NLTK (Loper and Bird, 2002) sentence tok-enizer and the sentences, in turn, were tokenizedusing the Spacy (v2.0.10) tokenizer4. Followingthis we removed all the non-functional tokens andattached POS tags to the remaining tokens.

4.2 Named Entity Recognition

Named Entity Recognition (NER) is the task ofidentifying domain-specific proper nouns in a sen-tence. In order to gain meaningful insights aboutthe major classes of biomedical entities presentin the dataset, it was necessary to tag the entitiesusing an NER module by fine-tuning on variousbiomedical datasets. Since the CORD-19 dataset isa collection of scientific articles, we use SciBERTfor NER extraction. SciBERT is a variant on theBERT (Devlin et al., 2019) model and is pretrainedon a scientific corpus of 1.14M articles where 82percent of the literature comprised of the biomedi-cal domain and the rest was from various computerscience domains. In order to extract chemical, pro-tein and disease entities, SciBERT is fine-tuned ondifferent task specific datasets one-by-one, namely,JNLPBA (Collier and Kim, 2004), CHEMDNER(Krallinger et al., 2015) and NCBI Disease Corpus(Dogan et al., 2014) to obtains proteins, chemicaland disease annotations respectively.

We use the SciBERT-scivocab-uncased modelfor NER extraction. The input to the SciBERTmodel is the pre-processed dataset modified ac-cording to the tokenization of BERT. The outputof the model consists of the input sentence alongwith labels according to the BIO scheme where “B”stands for Beginning of an entity tag, “I” stands forInside of an entity tag and “O” means Outside theentity as can be seen in NER module of Figure 1.

Due to a lack of human gold standard datasetfor NER on the CORD-19 data, we do not retainthe obtained fine-grained entity annotations. Fol-lowing the NER tagging, we therefore, tag the Pro-tein, DNA and RNA entities extracted upon fine-tuning the JNLPBA dataset simply as PROTEIN,CHEMDNER as CHEMICAL and NCBI-DiseaseCorpus as DISEASE. We drop all entities with tagsCell line and Cell type as they could not be mergedinto any existing categories.

4https://spacy.io/api/tokenizer

Page 4: ERLKG: Entity Representation Learning and Knowledge Graph

130

Figure 1: ERLKG Pipeline

4.3 Relation Extraction

From the NER module we obtain an annotateddataset. To further exploit the underlying informa-tion present in the running sentences we performintra sentence Relationship Extraction (RE) whichis the task of identifying relationships between anytwo named entities present within a sentence. Us-ing this RE module we try to identify the relation-ships that different pairs of entities have at sentencelevel. The output from the NER module was fur-ther processed in order to discover sentences con-taining more than one entity. For a given set of

entities, E, in a sentence, it is split into(

E2

)instances. So, a single sentence, is represented

as: X = {e1, e2, w1...wn} where e1 and e2 aretwo tagged entities and wj is the jth word in thesentence.

An approach similar to the NER moduleis performed, employing SciBERT for identi-fying relations from sentences through contex-tual evidence. We fine tune SciBERT on twodatasets, CHEMPROT (Kringelum et al., 2016) andBC5CDR (Li et al., 2016), to capture relations be-tween chemical-protein and chemical-disease pairs.

Following the RE task on the CORD-19 data,we combine the 13 different types of associ-ations obtained upon fine-tuning CHEMPROTas a single relation type called CHEMICAL-PROTEIN. Similarly, only the positive associationsobtained upon fine-tuning BC5CDR were retained

Page 5: ERLKG: Entity Representation Learning and Knowledge Graph

131

Task Types # ofinstances

Totalinstances

NERTaggedEntities

CHEMICAL 6153 64593

PROTEIN 42108DISEASE 16332

RelationPairs

CHEMICAL-PROTEIN

110485 111916

CHEMICAL-INDUCED-DISEASE

1431

Table 1: Statistics of the processed CORD-19 datasetfrom NER and RE Modules

as CHEMICAL-INDUCED-DISEASE. This en-sures that less error is propagated in the absenceof gold labels for RE. It also makes sure that thesubsequent task of obtaining KG and learning la-tent entity representations are not misguided duringtheir training phase.

4.4 Knowledge Graph ConstructionStatistics of the consolidated set of entity mentionsand relation pairs obtained as a result of NER andRE on the CORD-19 dataset can be seen from Table1. To obtain an overview of the different entitiesand their association with each other, we generatea KG which is a good association representation ofthe entire unstructured CORD-19 dataset.

We construct a KG which is defined as KG =(E,R,G), where,

• E: a set of nodes representing disease/ pro-tein/ drug entities

• R: a set of labels representing chemical-protein relation or chemical-disease

• G ⊆ E×R×E: a set of edges that representfacts connecting entity pairs.

Each fact is a triple 〈h, r, t〉, where h is the head, ris the relation, and t is the tail of the fact.

4.5 COV19 729After generating the KG, a list of all entities aresupplied to a physician, who clubbed the terms into3 groups based on their relatedness to COVID-19,i.e., NOT RELATED, PARTIALLY RELATED andHIGHLY RELATED. It was identified that the num-ber of entities in the HIGHLY RELATED group

are much less in comparison to the other two cate-gories. Thus, in order to reduce bias, the physiciansampled nearly equal number of entities from eachgroup, resulting in a final dataset comprising of729 entities named as COV19 729. This datasetwas then shuffled and passed on to two independentphysicians, who provided ratings to each sampleindicating how related an entity is to COVID-19on a scale of 0 (NOT RELATED) to 5 (HIGHLYRELATED).

The inter-rater agreement score (kappa score)is found to be 0.5116, which lies in the moder-ate agreement range. We, therefore, average outthe ratings and propose a relatively large, intrinsicevaluation dataset called COV19 729 for bench-marking COVID-19 related embedding techniques.Table 4 shows a snapshot of the COV19 729.

5 Experiment and Results

5.1 Implementation Details

To generate the intrinsic evaluation dataset, the totallist of 78K entities present in our KG are reducedto 5K by removing all entities having less than 5 in-degrees. This is done in order to reduce noise. Afterexperimenting with multiple values, the threshold 5provided the highest signal to noise ratio. For fine-tuning SciBERT, all hyper-parameters are left attheir default values except truncate long sequencesparameter which is set to false. For training KGembedding, in OpenKE (Han et al., 2018), the di-mension is set to 400 and the rest of the parametersare kept as default. In the case of GCN-AE (Kipfand Welling, 2016), for the link prediction task, thelearning rate is set to 0.01, epochs to 200, hiddenunits in the first and second layer as 32 and 16respectively.

5.2 Link Prediction

Latent entity representation learning of the con-structed KG is crucial so that one can effectivelyanalyze associations of any given biomedical en-tity with respect to COVID-19. Rather than ran-domly choosing a method, we first evaluate popularKG embedding techniques on a downstream Nat-ural Language Processing task of Link Prediction.We consider Node2Vec (Grover and Leskovec,2016), Tensor decomposition models like DisMult(Yang et al., 2015) and Geometric models, namely,TransE (Bordes et al., 2013), TransD (Ji et al.,2015), ComplEx (Trouillon et al., 2016) and RotatE(Sun et al., 2019).

Page 6: ERLKG: Entity Representation Learning and Knowledge Graph

132

The test and validation set is created from theremoved edges with the addition of equal numberof randomly sampled pairs of false links (nodes thatdid not have connections in the graph). The testand validation sets have 10 percent and 5 percentof true links, respectively. We use OpenKE (Hanet al., 2018) which is an Open-source Frameworkfor Knowledge Embedding techniques. The resultsare reported based on the model’s performanceon the test set. The embeddings resulting fromthese methods are treated as features, along withthe graph adjacency matrix and is fed to GCN-AE(Kipf and Welling, 2016). The Average Precisionand ROC score of each setting is noted and used tobenchmark these embedding types as can be seenin Table 2.

Method ROC APRotatE (Sun et al., 2019) 0.858 0.887TransD (Ji et al., 2015) 0.860 0.883TransE (Bordes et al., 2013) 0.853 0.877DistMult (Yang et al., 2015) 0.855 0.883ComplEx(Trouillon et al., 2016)

0.852 0.881

Node2Vec(Grover and Leskovec, 2016)

0.821 0.849

Table 2: Link Prediction performance of different KGembedding techniques on test set using GCN-AE

From Table 2 in terms of Average Precision, Ro-tatE performs the best among all KG embeddingswhile in terms of ROC score, TransD outperformsthe rest. Models like TransE capture inversion andcomposition patterns well, whereas models likeDisMult capture symmetrical relationships. But incase of RotatE all the different aspects like sym-metry, anti-symmetry, inversion and compositionare captured. Also, TransD has a similar perfor-mance to RotatE. This is because in our settingevery relationship pair has the head and tail en-tity to be of different entity types (either chemical-protein or chemical-disease). The inherent propertyof TransD to separate the head and tail entity spaceswas useful to model this graph structure. Hence,giving comparable results to RotatE.

Node2Vec performs relatively poor since it re-lies on the internal mechanism of grouping nodeswith identical connection patterns which could beless frequent in our KG as it is not raised from aninteraction network and is rather constructed fromentities and relations obtained from free text.

5.3 Intrinsic Evaluation

We conduct Intrinsic Evalaution where Table 3shows the performance of TransD and RotatE em-bedding methods in terms of Pearson and Spearmancorrelation scores between the ratings and the co-sine similarity scores of entities on the COV19 729dataset. The cosine similarity scores for each en-tity was generated with respect to the COVID-19 embedding vector obtained from our proposedpipeline. However, most of the top entities gen-erated by two of our best methods, TransD andRotatE (selected on basis of the link predictiontask) were not present in COV19 729 since the saiddataset was randomly sampled. In our view, theseentities require immediate attention and hence, weconduct another round of scoring to evaluate themand in the process, propose COV19 25.

Entity List SpearmanCorrelation

PearsonCorrelation

COV19 729 (TransD) 0.2186 0.2117COV19 729 (RotatE) 0.1933 0.1879COV19 25 (TransD) 0.4570 0.4348COV19 25 (RotatE) 0.4240 0.4105

Table 3: Pearson and Spearman Correlation values be-tween the ratings and the cosine similarity scores of729 randomly sampled entities and 25 pipeline pre-dicted entities with respect to the COVID-19 vector

5.3.1 COV19 25The top 100 predicted entities from TransD andRotatE were selected and an intersection of the gen-erated entities was taken, which was then passedon to a physician. The physician recommendeda list of 25 relevant entities, out of the providedset. This list was then sent to another physicianwho rated the entities based on their relatedness toCOVID-19. This was named as COV19 25.

It is evident from Table 3 that TransD hasthe highest Pearson and Spearman score on theCOV19 729 and COV19 25 datasets. Hence, weuse TransD as the final embedding generationmethod for ERLKG.

6 Discussion

We exploit the contextual evidence from CORD-19 corpus in finding entities and relations. Thisis followed by KG construction for determiningthe relatedness between any biomedical entity withrespect to COVID-19. A simple co-occurrence ma-

Page 7: ERLKG: Entity Representation Learning and Knowledge Graph

133

Entity Tags Cosine with COVID-19 Rating by physicianretinoic acid inducible gene-1 protein -0.079917936 0hydroxyprolinol chemical -0.018277158 2acute asthma attacks disease 0.05297136 1pc18 chemical 0.153728574 1s1 domain protein 0.166142751 2immunodominant epitopes protein 0.189406748 2nsp1 protein 0.202800478 3hcov whole genomes protein 0.306899184 1spike glycoprotein protein 0.413827424 4receptor binding domain protein 0.464432383 5

Table 4: Scores given by Raters on few samples from COV19 729. The cosine similarity scores are generated fromTransD embeddings

trix based method is not sufficient to capture thedifferent relationship association types. We, there-fore, use state-of-the-art SciBERT for the purposeof entity and relationship extraction. We constructa KG from entity pairs and the relationship amongthem. Our aim was to utilize this KG for effectiveassociation analysis for which identifying the bestentity representation was necessary. We thereforeconduct link prediction task and evaluate popularKG embedding techniques. Since, our KG con-sists of a simple bare-bone structure, deep learningbased KG embedding methods like ConvE werenot explored in this work. This is because suchmethods lead to an increase in the number of hy-perparameters while providing little to no explain-ability.

We face the challenge of an absence of groundtruth data for CORD-19 corpus. Thus, we conductextensive qualitative evaluations and in the process,introduce two gold-standard, annotated entity lists,COV19 25, and COV19 729. COV19 25 consistsof 25 entities predicted by the top two embeddingtechniques, TransD and RotatE while COV19 729consists of 729 entities sampled from the processedCORD-19 dataset. The ratings were based on anentity’s relatedness to COVID-19. From the cor-relation scores 3 of our intrinsic evaluation, weobserve that our model could provide considerableinsight in predicting important associations withrespect to COVID-19.

TransD has the highest Pearson and Spearmanscore on the COV19 729 and COV19 25 datasets.Hence, we use TransD as the final embedding gen-eration method for ERLKG. Figures 2, 3 and 4show the top related proteins, chemicals and dis-eases that ERLKG, using TransD embedding, pre-

Figure 2: COVID-19 related Proteins based on cosinesimilarity obtained from ERLKG. Entities color codedGreen signify a higher cosine similarity value com-pared to entities colour coded Yellow.

dicted with respect to COVID-19. Without usingany external knowledge resources, our pipeline pre-dicts various chemicals, proteins and diseases thatare highly related with COVID-19. These predictedentities could help the biomedical community toget a better understanding of COVID-19. A fewtop chemicals like Mitoxantrone (Giovannoni et al.,2020), Carfilzomib (Iyer et al., 2020), Flutamide(Cava et al., 2020), Bortezomib (Al Saleh et al.,2020), Lopinavir and Ritonavir (Cao et al., 2020)are being considered as a potential cure for thevirus. From the predicted proteins list, entities likePARP1 (Kouhpayeh et al., 2020), Spike protein(Bosch et al., 2003), Lactate Dehydrogenase (Hanet al., 2020) and NSP1 (Thoms et al., 2020) havedirect relevance with COVID-19. Entities like Ven-tricular tachycardia (VT) (Wu et al., 2020), Myas-

Page 8: ERLKG: Entity Representation Learning and Knowledge Graph

134

Figure 3: COVID-19 related Chemicals based on co-sine similarity obtained from ERLKG. Entities colorcoded Green signify a higher cosine similarity valuecompared to entities color coded Yellow.

Figure 4: COVID-19 related Diseases based on co-sine similarity obtained from ERLKG. Entities colorcoded Green signify a higher cosine similarity valuecompared to entities color coded Yellow.

thenic (Delly et al., 2020) crisis, Acute RespiratorySyndrome (Lai et al., 2020), ARDS (Respiratorydistress syndrome) (Marini and Gattinoni, 2020)and Thrombocytopenia (Lippi et al., 2020) are afew diseases that are very likely to occur in patientssuffering from COVID-19.

7 Conclusion and Future Work

We propose ERLKG, a generic pipeline, for associ-ation analysis with respect to a given entity froman unstructured dataset. The part of the pipelineintegrating IE and KG construction keeps human-out-of-the-loop. In order to learn the latent repre-

sentation of the formed KG, we first benchmarkvarious types of KG embedding techniques on thetask of Link Prediction. According to our exper-iments we find TransD and RotatE producing acomparable performance.

In this work our approach is evaluated onlyon CORD-19 dataset and no additional resourceshave been employed. However, due to the lackof gold standard data we introduce COV19 729,which is a list of extracted named entities fromour pipeline selected randomly and given to physi-cians for assigning association scores with respectto COVID-19. Owing to random selection most ofthe entities listed with greater association scoresby TransD and RotatE were found to be missing inCOV19 729 hence another set was given to physi-cians from the top entites which we call COV19 25.Finally TransD is used as our best KG embeddingtechnique to predict top entities that are closelyassociated to COVID-19 from CORD-19 corpus.As a future scope, we plan to implement a nor-malization and abbreviation expansion module af-ter the detection of entities. The study of thesetop predicted entities, by the domain experts, canhelp them understand the different types of associ-ations and relationships they exhibit with respectto COVID-19.

Acknowledgments

The authors acknowledge the Department ofBiotechnology, Govt. of India for the financialsupport for the project BT/COE/34/SP28408/2018.The authors would also like to thank Dr. ShahidAslam, Department of General Medicine, AMRIDhakuria Hospitals Kolkata and Dr. Khalid Iqbal,Assistant Professor, Eras Lucknow Medical Col-lege and Hospital, for helping with the IntrinsicEvaluation datasets and Sanket Wakade, Depart-ment of Design, IIT Guwahati for helping withillustrations. Besides, the authors would like tothank the anonymous reviewers for their valuablecomments and feedback.

ReferencesAbdullah S Al Saleh, Taimur Sher, and Morie A Gertz.

2020. Multiple myeloma in the time of covid-19.Acta haematologica, pages 1–7.

Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and MatthewB. A. McDermott. 2019. Publicly available clinicalbert embeddings. ArXiv, abs/1904.03323.

Page 9: ERLKG: Entity Representation Learning and Knowledge Graph

135

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert:A pretrained language model for scientific text. InEMNLP/IJCNLP.

Antoine Bordes, Nicolas Usunier, Alberto Garcıa-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In NIPS.

Berend Jan Bosch, Ruurd van der Zee, Cornelis A Mde Haan, and Peter J. M. Rottier. 2003. The coron-avirus spike protein is a class i virus fusion protein:Structural and functional characterization of the fu-sion core complex. Journal of Virology, 77:8801 –8811.

Bin Cao, Yeming Wang, Danning Wen, Wen Liu, JingliWang, Guohui Fan, Lianguo Ruan, Bin Song, Yan-ping Cai, Ming Wei, et al. 2020. A trial of lopinavir–ritonavir in adults hospitalized with severe covid-19.New England Journal of Medicine.

Claudia Cava, Gloria Bertoli, and Isabella Castiglioni.2020. In silico discovery of candidate drugs againstcovid-19. Viruses, 12.

Bin Chen, Xiao Dong, Dazhi Jiao, Huijun Wang,Qian Zhu, Ying Ding, and David J. Wild. 2009.Chem2bio2rdf: a semantic framework for linkingand data mining chemogenomic and systems chem-ical biology data. BMC Bioinformatics, 11:255 –255.

Chongyan Chen, Islam Akef Ebeid, Yi Bu, and YingDing. 2020. Coronavirus knowledge graph: A casestudy. ArXiv, abs/2007.10287.

Nigel Collier and Jin-Dong Kim. 2004. Introduc-tion to the bio-entity recognition task at jnlpba. InNLPBA/BioNLP.

Debsmita Das, Shashank Dubey, Aakash Deep Singh,Kushagra Agarwal, Sourojit Bhaduri, Rajesh KumarRanjan, Yatin Katyal, and Janu Verma. 2020. Infor-mation retrieval and extraction on covid-19 clinicalarticles using graph community detection and bio-bert embeddings.

Fadi Delly, Maryam Jamil Syed, Robert P. Lisak, andDeepti Zutshi. 2020. Myasthenic crisis in covid-19.Journal of the Neurological Sciences, 414:116888 –116888.

Franck Dernoncourt and J. Lee. 2017. Pubmed 200krct: a dataset for sequential sentence classificationin medical abstracts. In IJCNLP.

Tim Dettmers, Pasquale Minervini, Pontus Stene-torp, and Sebastian Riedel. 2018. Convolu-tional 2d knowledge graph embeddings. ArXiv,abs/1707.01476.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing. In NAACL-HLT.

Rezarta Islamaj Dogan, Robert Leaman, and ZhiyongLu. 2014. Ncbi disease corpus: A resource for dis-ease name recognition and concept normalization.Journal of biomedical informatics, 47:1–10.

Daniel Domingo-Fernandez, Shounak Baksi, BruceSchultz, Yojana Gadiya, Reagon Karki, TamaraRaschka, Christian Ebeling, Martin Hofmann-Apitius, and Alpha Tom Kodamullil. 2020. Covid-19 knowledge graph: a computable, multi-modal,cause-and-effect knowledge model of covid-19pathophysiology. bioRxiv.

Gavin Giovannoni, Chris Hawkes, Jeannette Lechner-Scott, Michael Levy, Emmanuelle Waubant, and Ju-lian Gold. 2020. The covid-19 pandemic and the useof ms disease-modifying therapies. Multiple Sclero-sis and Related Disorders, 39:102073.

Aditya Grover and Jure Leskovec. 2016. node2vec:Scalable feature learning for networks. Proceedingsof the 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining.

Xu Han, Shulin Cao, Lv Xin, Yankai Lin, Zhiyuan Liu,Maosong Sun, and Juanzi Li. 2018. Openke: Anopen toolkit for knowledge embedding. In Proceed-ings of EMNLP.

Yi Han, Haidong Zhang, Sucheng Mu, Wei Wei,Chaoyuan Jin, Yuan Xue, Chaoyang Tong, YunfeiZha, Zhenju Song, and Guorong Gu. 2020. Lactatedehydrogenase, a risk factor of severe covid-19 pa-tients. medRxiv.

Mahalaxmi Iyer, Kaavya Jayaramayya, Mohana DeviSubramaniam, Soo Bin Lee, Ahmed Abdal Dayem,Ssang-Goo Cho, and Balachandar Vellingiri. 2020.Covid-19: an update on diagnostic and therapeuticapproaches. BMB reports, 53(4):191.

Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and JunZhao. 2015. Knowledge graph embedding via dy-namic mapping matrix. In ACL.

Thomas Kipf and Max Welling. 2016. Variationalgraph auto-encoders. ArXiv, abs/1611.07308.

Shirin Kouhpayeh, Laleh Shariati, Maryam Boshtam,Ilnaz Rahimmanesh, Mina Mirian, Mehrdad Zeina-lian, Azhar Salari-jazi, Negar Khanahmad, Moham-mad Sadegh Damavandi, Parisa Sadeghi, et al. 2020.The molecular story of covid-19; nad+ depletion ad-dresses all questions in this infection.

Martin Krallinger, Obdulia Rabal, Florian Leit-ner, Miguel Vazquez, David Salgado, ZhiyongLu, Robert Leaman, Yanan Lu, Dong-Hong Ji,Daniel M. Lowe, Roger A. Sayle, Riza TheresaBatista-Navarro, Rafal Rak, Torsten Huber, TimRocktaschel, Sergio Matos, David Campos, BuzhouTang, Hua Xu, Tsendsuren Munkhdalai, Keun HoRyu, S. V. Ramanan, P. Senthil Nathan, SlavkoZitnik, Marko Bajec, Lutz Weber, Matthias Irmer,Saber A. Akhondi, Jan A. Kors, Shuo Xu, Xin An,

Page 10: ERLKG: Entity Representation Learning and Knowledge Graph

136

Utpal Kumar Sikdar, Asif Ekbal, Masaharu Yosh-ioka, Thaer M. Dieb, Miji Choi, Karin M. Verspoor,Madian Khabsa, C. Lee Giles, Hongfang Liu, K. E.Ravikumar, Andre Lamurias, Francisco M. Couto,Hong-Jie Dai, Richard Tzong-Han Tsai, Caglar Ata,Tolga Can, Anabel Usie, Rui Alves, Isabel Segura-Bedmar, Paloma Martınez, Julen Oyarzabal, andAlfonso Valencia. 2015. The chemdner corpus ofchemicals and drugs and its annotation principles.Journal of Cheminformatics, 7:S2 – S2.

Jens Kringelum, Sonny Kim Kjærulff, Søren Brunak,Ole Lund, Tudor I. Oprea, and Olivier Taboureau.2016. Chemprot-3.0: a global chemical biology dis-eases mapping. Database: The Journal of Biologi-cal Databases and Curation, 2016.

Chih-Cheng Lai, Tzu-Ping Shih, Wen-Chien Ko, Hung-Jen Tang, and Po-Ren Hsueh. 2020. Severe acuterespiratory syndrome coronavirus 2 (sars-cov-2) andcorona virus disease-2019 (covid-19): the epidemicand the challenges. International journal of antimi-crobial agents, page 105924.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, D. Kim,Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020.Biobert: a pre-trained biomedical language represen-tation model for biomedical text mining. Bioinfor-matics.

Jiao Li, Yueping Sun, Robin J. Johnson, Daniela Sci-aky, Chih-Hsuan Wei, Robert Leaman, Allan PeterDavis, Carolyn J. Mattingly, Thomas C. Wiegers,and Zhiyong Lu. 2016. Biocreative v cdr task cor-pus: a resource for chemical disease relation ex-traction. Database: The Journal of BiologicalDatabases and Curation, 2016.

Giuseppe Lippi, Mario Plebani, and Brandon MichaelHenry. 2020. Thrombocytopenia is associated withsevere coronavirus disease 2019 (covid-19) infec-tions: a meta-analysis. Clinica Chimica Acta.

Edward Loper and Steven Bird. 2002. Nltk: the naturallanguage toolkit. arXiv preprint cs/0205028.

John J Marini and Luciano Gattinoni. 2020. Manage-ment of covid-19 respiratory distress. Jama.

Dai Quoc Nguyen, Thanh Vu, T. Nguyen, Dat QuocNguyen, and Dinh Q. Phung. 2019. A capsulenetwork-based embedding model for knowledgegraph completion and search personalization. ArXiv,abs/1808.04122.

Jong Won Park. 2020. Continual bert: Continual learn-ing for adaptive extractive summarization of covid-19 literature. ArXiv, abs/2007.03405.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. ArXiv, abs/1802.05365.

Sumanta Ray, Snehalika Lall, Anirban Mukhopadhyay,Sanghamitra Bandyopadhyay, and Alexander Schon-huth. 2020. Predicting potential drug targets and re-purposable drugs for covid-19 via a deep generativemodel for graphs.

Peter Richardson, Ivan Griffin, C. Tucker, D. Smith,Olly Oechsle, Anne Phelan, Michael Rawling, Ed-ward Savory, and J. Stebbing. 2020. Baricitinib aspotential treatment for 2019-ncov acute respiratorydisease. Lancet (London, England), 395:e30 – e31.

Kirk Roberts, Tasmeer Alam, Steven Bedrick, DinaDemner-Fushman, Kyle Lo, Ian Soboroff, Ellen M.Voorhees, Lucy Lu Wang, and William R. Hersh.2020. Trec-covid: Rationale and structure of an in-formation retrieval shared task for covid-19. Journalof the American Medical Informatics Association :JAMIA.

Andrea Rossi, Donatella Firmani, Antonio Mati-nata, Paolo Merialdo, and Denilson Barbosa. 2020.Knowledge graph embedding for link prediction: Acomparative analysis. ArXiv, abs/2002.00819.

Stuart J. Russell and Peter Norvig. 2010. Artificial in-telligence : a modern approach - 3rd ed.global ed.

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and JianTang. 2019. Rotate: Knowledge graph embeddingby relational rotation in complex space. ArXiv,abs/1902.10197.

Matthias Thoms, Robert Buschauer, Michael Ameis-meier, Lennart Koepke, Timo Denk, Maximil-ian Hirschenberger, H. Kratzat, Manuel Hayn,T. Mackens-Kiani, Jingdong Cheng, C. Sturzel,T. Frohlich, O. Berninghausen, T. Becker, F. Kirch-hoff, K. Sparrer, and R. Beckmann. 2020. Struc-tural basis for translational shutdown and immuneevasion by the nsp1 protein of sars-cov-2. bioRxiv.

Theo Trouillon, Johannes Welbl, Sebastian Riedel, EricGaussier, and Guillaume Bouchard. 2016. Com-plex embeddings for simple link prediction. ArXiv,abs/1606.06357.

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar,Russell Reas, Jiangjiang Yang, Darrin Eide, KathrynFunk, Rodney Michael Kinney, Ziyang Liu, William.Merrill, Paul Mooney, Dewey A. Murdick, DevvretRishi, Jerry Sheehan, Zhihong Shen, Brandon Stil-son, Alex D. Wade, Kuansan Wang, Christopher Wil-helm, Boya Xie, Douglas M. Raymond, Daniel S.Weld, Oren Etzioni, and Sebastian Kohlmeier.2020a. Cord-19: The covid-19 open researchdataset. ArXiv.

Qingyun Wang, Manling Li, X. Wang, Nikolaus NovaParulian, Guangxing Han, Jiawei Ma, Jingxuan Tu,Ying Lin, H. Zhang, Weili Liu, Aabhas Chauhan,Yingjun Guan, Bangzheng Li, Ruisong Li, Xi-angchen Song, Huai zhong Ji, Jiawei Han, Shih-Fu Chang, J. Pustejovsky, D. Liem, A. El-Sayed,Martha Palmer, Jasmine Rah, C. Schneider, and

Page 11: ERLKG: Entity Representation Learning and Knowledge Graph

137

B. Onyshkevych. 2020b. Covid-19 literature knowl-edge graph construction and drug repurposing reportgeneration. ArXiv, abs/2007.00576.

Xuan Wang, Xiangchen Song, Yingjun Guan,Bangzheng Li, and Jiawei Han. 2020c. Compre-hensive named entity recognition on cord-19 withdistant or weak supervision. ArXiv, abs/2003.12218.

Cheng-I Wu, Pieter G Postema, Elena Arbelo, Eli-jah R Behr, Connie R Bezzina, Carlo Napolitano,Tomas Robyns, Vincent Probst, Eric Schulze-Bahr,Carol Ann Remme, et al. 2020. Sars-cov-2, covid-19 and inherited arrhythmia syndromes. HeartRhythm.

Bishan Yang, Wen tau Yih, Xiaodong He, JianfengGao, and Li Deng. 2015. Embedding entities andrelations for learning and inference in knowledgebases. CoRR, abs/1412.6575.

Z. Yang, Zihang Dai, Yiming Yang, J. Carbonell,R. Salakhutdinov, and Quoc V. Le. 2019. Xlnet:Generalized autoregressive pretraining for languageunderstanding. In NeurIPS.