10
MIMICause : Defining, identifying and predicting types of causal relationships between biomedical concepts from clinical notes Vivek Khetan *† Md Imbesat Hassan Rizvi †♣ Jessica Huber Paige Bartusiak ‡♦ Bogdan Sacaleanu Andrew Fano Accenture Labs, SF, Indian Institute of Science Arizona State University, Duke University {vivek.a.khetan, bogdan.e.sacaleanu, andrew.e.fano}@accenture.com [email protected], [email protected], [email protected] Abstract Understanding of causal narratives communi- cated in clinical notes can help make strides towards personalized healthcare. In this work, MIMICause, we propose annotation guide- lines, develop an annotated corpus and provide baseline scores to identify types and direction of causal relations between a pair of biomedi- cal concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences. We annotate a total of 2714 de-identified exam- ples sampled from the 2018 n2c2 shared task dataset and train four different language model based architectures. Annotation based on our guidelines achieved a high inter-annotator agreement i.e. Fleiss’ kappa (κ) score of 0.72 and our model for identification of causal re- lation achieved a macro F1 score of 0.56 on test data. The high inter-annotator agreement for clinical text shows the quality of our anno- tation guidelines while the provided baseline F1 score sets the direction for future research towards understanding narratives in clinical texts. 1 Introduction Electronic Health Records (EHRs) have significant amounts of unstructured clinical notes containing rich description of patients’ states as observed by healthcare professionals over time. Our ability to effectively parse and understand clinical narratives depends upon the quality of extracted biomedical concepts and semantic relations. With the contemporary advancement in nat- ural language processing (NLP), we have wit- nessed an increased interest in tasks such as ex- traction of biomedical concepts, patients’ data de- identification, medical question answering and re- lation extraction. While these tasks have improved * Corresponding Author Equal contribution Contributed during an internship at Accenture labs, SF our ability for clinical narrative understanding, identification of semantic causal relations between biomedical entities will further enhance it. Identification of novel and interesting causal ob- servations from clinical notes can be instrumental to a better understanding of patients’ health. It can also help us identify potential causes of diseases and determine their prevention and treatment. De- spite the usefulness of identification and extraction of causal relation types, our capability to do so is limited and remains a challenge for specialized domains like healthcare. The NLP community has been actively working on causality understanding from text and has pro- posed various methodologies to represent (Talmy, 1988; Wolff, 2007; Swartz, 2014; Hassanzadeh et al., 2019), as well as extract (O’Gorman et al., 2016; Mirza and Tonelli, 2014; Khetan et al., 2022), causal associations between the events expressed in natural language text. A large amount of work in causality representation focuses on decomposition of causal interactions communicated in text data, whereas, the work in causality extraction mostly fo- cuses on causal event extraction or identification of causal relation between already identified events. There are various proposed representations for causality depending upon the domain and the task. In the healthcare domain, most of the related work can be grouped around the problem of adverse drug effect identification from biomedical scientific ar- ticles (Gurulingappa et al., 2012) or clinical notes (Johnson et al., 2016; Henry et al., 2020) and iden- tification of cause, effect and their triggers (Mihaila et al., 2012). There is no work that has yet tried to represent different types of causal associations along with direction (between biomedical concepts) communicated in clinical notes. In this work, we fill the gap by defining types of semantic causal relations between biomedical entities, building detailed annotation guidelines and annotating a large dataset. arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

MIMICause : Defining, identifying and predicting types of causalrelationships between biomedical concepts from clinical notes

Vivek Khetan∗†♠ Md Imbesat Hassan Rizvi† ♣ Jessica Huber‡ ♥Paige Bartusiak‡ ♦ Bogdan Sacaleanu♠ Andrew Fano ♠

♠Accenture Labs, SF, ♣Indian Institute of Science♥Arizona State University, ♦Duke University

{vivek.a.khetan, bogdan.e.sacaleanu, andrew.e.fano}@[email protected], [email protected], [email protected]

Abstract

Understanding of causal narratives communi-cated in clinical notes can help make stridestowards personalized healthcare. In this work,MIMICause, we propose annotation guide-lines, develop an annotated corpus and providebaseline scores to identify types and directionof causal relations between a pair of biomedi-cal concepts in clinical notes; communicatedimplicitly or explicitly, identified either in asingle sentence or across multiple sentences.

We annotate a total of 2714 de-identified exam-ples sampled from the 2018 n2c2 shared taskdataset and train four different language modelbased architectures. Annotation based onour guidelines achieved a high inter-annotatoragreement i.e. Fleiss’ kappa (κ) score of 0.72and our model for identification of causal re-lation achieved a macro F1 score of 0.56 ontest data. The high inter-annotator agreementfor clinical text shows the quality of our anno-tation guidelines while the provided baselineF1 score sets the direction for future researchtowards understanding narratives in clinicaltexts.

1 Introduction

Electronic Health Records (EHRs) have significantamounts of unstructured clinical notes containingrich description of patients’ states as observed byhealthcare professionals over time. Our ability toeffectively parse and understand clinical narrativesdepends upon the quality of extracted biomedicalconcepts and semantic relations.

With the contemporary advancement in nat-ural language processing (NLP), we have wit-nessed an increased interest in tasks such as ex-traction of biomedical concepts, patients’ data de-identification, medical question answering and re-lation extraction. While these tasks have improved

∗Corresponding Author†Equal contribution‡Contributed during an internship at Accenture labs, SF

our ability for clinical narrative understanding,identification of semantic causal relations betweenbiomedical entities will further enhance it.

Identification of novel and interesting causal ob-servations from clinical notes can be instrumentalto a better understanding of patients’ health. It canalso help us identify potential causes of diseasesand determine their prevention and treatment. De-spite the usefulness of identification and extractionof causal relation types, our capability to do sois limited and remains a challenge for specializeddomains like healthcare.

The NLP community has been actively workingon causality understanding from text and has pro-posed various methodologies to represent (Talmy,1988; Wolff, 2007; Swartz, 2014; Hassanzadehet al., 2019), as well as extract (O’Gorman et al.,2016; Mirza and Tonelli, 2014; Khetan et al., 2022),causal associations between the events expressedin natural language text. A large amount of work incausality representation focuses on decompositionof causal interactions communicated in text data,whereas, the work in causality extraction mostly fo-cuses on causal event extraction or identification ofcausal relation between already identified events.

There are various proposed representations forcausality depending upon the domain and the task.In the healthcare domain, most of the related workcan be grouped around the problem of adverse drugeffect identification from biomedical scientific ar-ticles (Gurulingappa et al., 2012) or clinical notes(Johnson et al., 2016; Henry et al., 2020) and iden-tification of cause, effect and their triggers (Mihailaet al., 2012). There is no work that has yet triedto represent different types of causal associationsalong with direction (between biomedical concepts)communicated in clinical notes.

In this work, we fill the gap by defining typesof semantic causal relations between biomedicalentities, building detailed annotation guidelines andannotating a large dataset.

arX

iv:2

110.

0709

0v1

[cs

.CL

] 1

4 O

ct 2

021

Page 2: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

Table 1: MIMICause sample identification and annotations

Snippet from Clinical Notes Text with identified entities Causal relation identificationbased on the proposed MIMICauseguidelines

Admission Date: [**2161-11-22**]Discharge Date: [**2161-11-27**]Date of Birth: [**2109-6-15**]Sex: F

Service: MEDICINEAllergies: Ciprofloxacin

Attending:[**First Name3 (LF) 613**]

Chief Complaint: Transfer from [**Hospital3 15174**]for treatment of altered mental status.Major Surgical or Invasive Procedure: None.

History of Present Illness:The patient is a 52 year old woman with past medical historysignificant for chronic pain on narcotics and benzodiazepines,malabsorption syndrome due to complications of gastric bypasssurgery, and severe osteoporosis. Three days prior to her admissionto the outside hospital, the patient presented to her PCP’s officefor evaluation of 20 pound weight loss that had occurred overthe past 6-8 weeks. The patient was found to have a urinary tractinfection. She was prescribed Ciprofloxacin and she took twodoses of the antibiotic. The following day, the patient’s husbandnoted that his wife seemed very nervous and agitated...

The patient is a 52 year old woman with pastmedical history significant for chronic pain onnarcotics and benzodiazepines, malabsorptionsyndrome due to complications of gastric bypasssurgery, and severe osteoporosis.

complications of gastric bypasssurgery enable malabsorptionsyndrome

The patient is a 52 year old woman with pastmedical history significant for chronic pain onnarcotics and benzodiazepines, malabsorptionsyndrome due to complications of gastric bypasssurgery, and severe osteoporosis.

severe osteoporosis enable malab-sorption syndrome

The patient is a 52 year old woman with pastmedical history significant for chronic pain onnarcotics and benzodiazepines, malabsorptionsyndrome due to complications of gastric bypasssurgery, and severe osteoporosis.

complications of gastric bypasssurgery, and severe osteoporosiscause malabsorption syndrome

The patient was found to have a urinary tractinfection. She was prescribed Ciprofloxacin andshe took two doses of the antibiotic.

Ciprofloxacin prevent urinary tractinfection

Table 1 shows a snippet of a clinical note ex-tracted from the n2c2 dataset (Henry et al., 2020),different sets of annotated biomedical entities andthe causal relation based on the proposed MIMI-Cause guidelines outlined in Section 3.1.

Even with the inherent complexities of clinicaltext data (e.g., domain knowledge, short hand bydoctors, etc.), following our proposed guidelines,we achieved a high inter annotator agreement ofFleiss’ kappa (κ) score of 0.72 for the task of causalrelation prediction.

2 Related Works

Identification of semantic causal relationships fromtext data is helpful in extraction of causal eventchains and can also be leveraged to reason andexplain the state of events in a given narrative. Al-though there are various works defining causationin linguistics, philosophy and psychology, therestill isn’t an agreed upon definition for decompo-sition and representation of causation. While re-searchers from psychology and philosophy workon causality with the goal of understanding theworld; most linguistic researchers focus on rep-resenting causality to understand interactions be-tween events.

Talmy (1988) proposed force-dynamics to de-compose the causal interaction between events as“letting”, “helping”, “hindering” etc. Wolff (2007)built upon force-dynamics by incorporating the the-ory of causal verbs and proposed Dynamic-modelof causation. Wolff categorised causation in threecategories, “Cause”, “Enable” and “Prevent”, andprovided a set of causal verbs to express these cate-gories.

Dunietz et al. (2015; 2017) proposed BECauSECorpus to represent an expression of causation in-stead of real-world causation or its philosophicalmeaning. BECauSE 1.0 (Dunietz et al., 2015) con-sists of a cause span, an effect span, and a causalconnective span; but this annotation schema strug-gles with the distinction between the cause eventand the causing agent. For example, in the sentence“I prevented a fire”, based on BECause 1.0, insteadof an action, the actor “I” is identified as cause.Later, BECauSE 2.0 (Dunietz et al., 2017) pro-posed the use of means argument to capture causeevent and causing agent without struggle. Theyalso proposed three different types of causation(Consequence, Motivation, and Purpose) and alsoannotated overlapping semantic relations betweencause and effect along with causal relations. In con-

Page 3: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

trast, our work focuses on identification of types ofcausal association between biomedical concepts ascommunicated in clinical notes.

More recently, Mostafazadeh et al. (2016b)built upon the work of Wolff and proposed anno-tation framework CaTeRS to represent causal rela-tions between events for commonsense perspective.CaTeRS categorises semantic relations betweenevents to capture causal and temporal relationshipsfor narrative understanding on crowd-sourced ROC-Stories dataset (Mostafazadeh et al., 2016a) but hasonly 488 causal links. In comparison, our MIMI-Cause dataset is built on actual clinical narratives,i.e., MIMIC-III Clinical text data (Johnson et al.,2016) and has 1923 causal observations.

Another interesting decomposition of causationis proposed by Swartz (2014) as a necessary andsufficient condition, but such detailed informationis seldom communicated in clinical observatorynotes. There have been several other recent at-tempts of modeling and extracting causality fromunstructured text. Bethard et al. (2008) createda causality dataset using the Wall Street Journalcorpus and captured directionality of causal inter-action with simple temporal relations (e.g., Before,After, No-Rel) but did not focus on the type ofcausality between the events. The work of Gormanet al. on Richer Event Description (RED) (Ikutaet al., 2014) describes causality types as cause andprecondition and uses negative polarity to capturethe context of hinder and prevent. This is in linewith the annotation guidelines proposed in our cur-rent work, but we also defined explicit Hinder andPrevent causality types along with directionality.

Mirza et al. (2014) proposed the use of explicitlinguistic markers, i.e., CLINKs (due to, becauseof, etc.) to extended TimeML TLINKs (Puste-jovsky et al., 2003) based temporal annotationsto capture causality between identified events. Theresulting dataset has temporal as well as casual re-lations but still lacks the causality types betweenevents. Hassanzadeh et al. (2019) proposed theuse of binary questions to extract causal knowledgefrom unstructured text data but did not focus ontypes and directionality of causal relations. Morerecently, Khetan et al. (2022) used language mod-els combining event description with events’ con-text to predict causal relationship. Their networkarchitecture wasn’t trained to predict the type or di-rectionality of causal relations. Furthermore, theyremoved the directionality provided in SemEval-

2007 (Girju et al., 2007), and SemEval-2010 (Hen-drickx et al., 2009) datasets to evaluate their modelon a larger causal relation dataset. Our causalityextraction network is built upon their methodology,i.e., Causal-BERT but also focuses on directional-ity as well as types of causality communicated inclinical narratives.

Although Causality lies at the heart of biomedi-cal knowledge, there are only a handful of works(mostly Adverse Drug Effect (Gurulingappa et al.,2012)) extracting causality from biomedical or clin-ical text data. One interesting work is BioCause byMihaila et al. (2012), which annotates existing bio-event corpora from biomedical scientific articles tocapture biomedical causality. Instead of identify-ing the type (and direction) of causal interactionin the already provided events of interest, they areannotating two types of text spans, i.e., argumentsand triggers. Arguments are text spans that can berepresented as events with type Cause, Effect, andEvidence while Trigger spans (can be empty) areconnectives between the casual events.

Our work proposes comprehensive guidelinesto represent the type and direction of causal as-sociations between biomedical entities expressedexplicitly or implicitly in the same or multiple sen-tences in clinical notes and is not covered by anyrelated work.

3 MIMICause Dataset creation

We used publicly available 2018 n2c2 shared task(Henry et al., 2020) dataset on adverse drug eventsand medication extraction to build the MIMICausedataset. The n2c2 dataset was used because it isbuilt upon the de-identified discharge summariesfrom the MIMIC-III clinical care database (John-son et al., 2016) and has nine different annota-tions of biomedical entities e.g. Drug, Dose, ADE,Reason, Route etc. The types of biomedical con-cepts/entities with a few examples as defined in then2c2 dataset are presented in Table 2. The distinc-tion between ADE and Reason concepts is based onwhether the drug was given to address the disease(Reason) or led to the disease (ADE).

However, the provided relationships in the n2c2dataset are simply defined by identified conceptslinked with related medications and hold no seman-tic meaning. To create the MIMICause dataset, weextracted1 examples from each entity-pair avail-

1We used https://spacy.io/ library with“en core web sm” language model.

Page 4: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

Table 2: Examples of Bio-medical concepts/entities inthe 2018 n2c2 shared task dataset.

Concepts/Entities Examples

Drugmorphine, ibuprofen, antibiotics (or “abx”as its abbreviation), chemotherapy etc.

ADE and Reasonnausea, seizures, Vitamin K deficiency,cardiac event during induction etc.

Strength10 mg, 60 mg/0.6 mL, 250/50 (e.g. as inAdvair 250/50), 20 mEq, 0.083% etc.

FormCapsule, syringe, tablet, nebulizer, appl(abbreviation for apply topical) etc.

DosageTwo (2) units, one (1) mL, max dose, bo-lus, stress dose, taper etc.

FrequencyDaily, twice a day, Q4H (every 4 Hrs), prn(pro re nata i.e as needed) etc.

RouteTransfusion, oral, gtt (guttae i.e. by drops),inhalation IV (i.e. Intravenous) etc.

Duration For 10 days, chronic, 2 cycles, over 6hours, for a week etc.

able in the n2c2 dataset. Our final dataset has1107 “ADE-Drug” , 1007 “Reason-Drug” and100 from each of “Strength-Drug”, “Form-Drug”,“Dosage-Drug”, “Frequency-Drug”, “Route-Drug”and “Duration-Drug” entity-pair examples.

3.1 Annotation guidelinesOur annotation guidelines are defined to repre-sent nine semantic causal relationships betweenbiomedical concepts/entities in clinical notes. Ourguidelines have four types of causal associations,each with two directions, and a non-causal “Other”class. Based on our guidelines, causal relation-ships/associations exist when one or more entitiesaffect another set of entities. The driving conceptcan be a single entity such as a drug / procedure /therapy or a composite entity such as several drugs/ procedures / therapies together.

3.1.1 Direction of causal associationThe direction of causal association between entitiesis captured by the order of entity tags ((e1, e2) or(e2, e1)) in the defined causal relationships. Eitherentity can be referred to as e1 or e2. The entity thatinitiates or drives the causal interaction is placedfirst in parenthesis followed by the resulting entityor effect.

1. Odynophagia: Was presumed dueto <e2>mucositis</e2> from recent<e1>chemotherapy</e1>.

2. Odynophagia: Was presumed dueto <e1>mucositis</e1> from recent<e2>chemotherapy</e2>.

Example (1) and (2) are different because the entityreferences are reversed. Regardless of the entitytags, in the context of the example, “chemotherapy”is the driving entity that led to the emergence of“mucositis”. Therefore, example (1) is annotatedwith causal direction (e1, e2) while example (2) isannotated with (e2, e1).

3.2 Explicitness / Implicitness of the causalindication

Our guidelines also capture causality expressedboth explicitly and implicitly. In example (1), thecausality is expressed explicitly using lexical causalconnective “due to”. Whereas in example (3), thecausal association between “erythema” and “Di-lantin” can only be understood based on the overallcontext of all the sentences.

3. patient’s wife noticed <e2>erythemaon patient’s face</e2>. On [**3-27**]the visiting nurse [**First Name(Titles) 8706**][**Last Name (Ti-tles)11282**]of a rash on his arms aswell. The patient was noted to befebrile and was admitted to the [**Com-pany 191**] Firm. In the EW, pa-tient’s <e1>Dilantin</e1> was dis-continued and he was given Tegretol in-stead.

3.2.1 (Un)-certainty of causal associationEstablishing real-world causality or the task ofcausal inference is not in the scope of our currentwork. Based on our proposed guidelines, a causalassociation either expressed as speculation or withcertainty is annotated similarly.

4. # <e1>Normocytic Anemia</e1> -Was 32.8 at OSH; after receiving fluidsHCT has fallen further to 30. Baseline is35 - 40. Not clinically bleeding. Perhapsdue to <e2>chemotherapy</e2>.

In example (4), causality between biomedical enti-ties is speculated through “Perhaps”. While repre-senting speculative causal associations can furtherenrich narrative understanding; it is not covered inour current work.

3.2.2 Types of causal associationThis section provides detailed guidelines for vari-ous types of causal relations (each with two direc-tions) and one non-causal relation (“Other”) alongwith accompanying examples.

Page 5: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

• Cause(e1, e2) or Cause(e2, e1) – Causal rela-tions between biomedical entities are of theseclasses if the emergence, application or in-crease of a single or composite entity exclu-sively leads to the emergence or increase ofone or a set of entities.

5. The patient is a 52 year old womanwith past medical history significant forchronic pain on narcotics and benzo-diazepines, <e2>malabsorptionsyndrome</e2> due to<e1>complications of gastricbypass surgery, and severe osteoporo-sis</e1>.

In example (5), “malabsorption syndrome” oc-curred due to two factors viz. “complicationsof gastric bypass surgery” and “severe osteo-porosis”. The entity span covers both of themand they are considered together as a com-posite entity leading to “malabsorption syn-drome”. Hence, example (5) is annotatedas Cause(e1, e2). The annotation would havebeen different had these entities been consid-ered individually.

Thus, the “Cause” category is assigned only ifthe driving entity is responsible in entirety forthe effect. If the specified entity is responsiblefor the effect in part, then a different causalrelation is defined to express this contrast.

• Enable(e1, e2) or Enable(e2, e1) – Causal re-lations between biomedical entities are ofthese classes if the emergence, applicationor increase of a single or composite entityleads to the emergence or increase of one ora set of entities in a setting where a numberof factors are at play and the entity under con-sideration is one of the contributing factors.

6. The patient is a 52 year old womanwith past medical history significantfor chronic pain on narcotics and ben-zodiazepines, <e2>malabsorptionsyndrome</e2> due to complica-tions of gastric bypass surgery, and<e1>severe osteoporosis</e1>.

Example (6) is same as the example (5) ex-cept for the entities in considerations. Boththe factors viz. “complications of gastric by-pass surgery” and “severe osteoporosis” arecontributing to the “malabsorption syndrome”.

Since the example is considering only “severeosteoporosis” which is a contributing factorin part, hence it is annotated as Enable(e1, e2).

With “Enable” relation type, it can easily benoted that addressing only the “complicationsof gastric bypass surgery” or “severe osteo-porosis” will not lead to the treatment of “mal-absorption syndrome”. Labelling these sam-ples as “Cause” would have suppressed thisdetail and the actions taken.

• Prevent(e1, e2) or Prevent(e2, e1) – Causalrelations between biomedical entities are ofthese classes if the emergence, applicationor increase of a single or composite entityexclusively leads to the eradication, preven-tion or decrease of one or a set of entities.

This class includes the scenario of prevent-ing a disease or condition from occurring aswell as curing a disease or condition if it hasoccurred.

7. with chest and <e1>abdominalpain</e1> and odynophagia who wasfound to have circumferential muralthrombus in the supra-renal aorta oncross sectional imaging.At that timethe aortic pathology was though to bechronic.Ultimately, her pain resolvedwith the initiation of a <e2>PPI andGI cocktail</e2>, and was dischargedhome after a 3 day hospital stay.

In example (7), “PPI” and “GI cocktail” arethe two different entities used in conjunctionto resolve the “abdominal pain”. Since thecausal relation is to be identified by consider-ing them as a composite entity, the exampleis labelled as Prevent(e2, e1). The annotationwould have been different had these entitiesbeen considered individually.

• Hinder(e1, e2) or Hinder(e2, e1) – Causal re-lations between biomedical entities are ofthese classes if the emergence, applicationor increase of a single or composite entityleads to the eradication, prevention or de-crease of one or a set of entities in a settingwhere a number of factors are at play and theentity under consideration is one of the con-tributing factors.

Similar to “Prevent”, this label also includesthe scenario of hindering a disease or condi-

Page 6: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

tion from occurring as well as curing a dis-ease or condition if it has occurred.

8. with chest and <e1>abdominalpain</e1> and odynophagia whowas found to have circumferentialmural thrombus in the supra-renalaorta on cross sectional imaging.Atthat time the aortic pathology wasthough to be chronic.Ultimately, herpain resolved with the initiation of a<e2>PPI</e2> and GI cocktail, andwas discharged home after a 3 dayhospital stay.

Example (8) is same as the example (7) ex-cept for the entities in considerations. Boththe entities i.e. “PPI” and “GI cocktail” arecontributing to the resolution of “abdominalpain”. Since the example is considering only“PPI” individually as a contributing factor inpart, it is annotated as Hinder(e2, e1).

This distinction between “Prevent” and “Hin-der” can be useful in scenarios such as iden-tifying conditions that may require the use ofmultiple drugs for treatment.

• Other – We defined “Other” class to annotateexamples with non-causal interaction betweenbiomedical entities. Examples of “Other”class can either have no relationship betweenbiomedical entities of interest or some othersemantic relationship that’s not causal. Beingnon-causal in nature, the “Other” class doesn’thave a sense of direction associated with it.

Based on our guidelines, examples with am-biguous overall context for all the annotators,marked entities without direct causal associ-ation (an entity leading to a condition andthat condition affecting other entity) and sam-ples from non-causal entity-pairs in the n2c2dataset (i.e., Form-Drug, Route-Drug, etc.)are also labelled as “Other”.

9. Patient has tried and failed<e2>Nexium</e2>, reporting it hasnot helped his <e1>gastritis</e1>for 3 months.

10. Thus it was believed that thept’s <e1>altered mental sta-tus</e1> was secondary to<e2>narcotics</e2> withdrawal.

Table 3: Causal types and their final counts

Annotation Count

Causal

e1 as agent, e2 as effect

Cause(e1, e2) 354Enable(e1, e2) 174Prevent(e1, e2) 261Hinder(e1, e2) 154

e2 as agent, e1 as effect

Cause(e2, e1) 370Enable(e2, e1) 176Prevent(e2, e1) 249Hinder(e2, e1) 185

Other – Other 791

Total 2714

11. Atenolol was held given patient wasstill on <e2>amiodarone</e2><e1>taper</e1>.

In example (9), “Nexium” was taken to pre-vent / cure “gastritis” but the expected effectis explicitly stated to be not observed. In ex-ample (10), the “altered mental status” is ob-served due to “narcotics withdrawal”, how-ever, the entity span refers only to the “nar-cotics”. Example (11) is from the “Dosage-Drug” entity-pair of the n2c2 dataset and hasno causal association between the entities.

Therefore, these examples are annotatedas “Other”. Similarly, examples withentity-pairs from “Form-Drug”, “Strength-Drug”, “Frequency-Drug”, ‘Route-Drug” and“Duration-Drug” are also labelled as “Other”.

To summarize, we defined annotation guide-lines for nine semantic causal relations (8 Causal+ Other) between biomedical entities expressed inclinical notes. Our annotated dataset has exampleswith both explicit and implicit causality in whichentities are in the same sentence or different sen-tences. The final count of examples for each causaltype with direction is in Table 3.

3.3 Inter-annotator agreement

It’s difficult to comprehend narrative expressed inclinical notes due to need of domain knowledge,short hand used by the doctors, use of abbreviations(Table 4), context spread over many sentences aswell as the explicit and implicit nature of commu-nication.

Given the nature of our base data (MIMIC-IIIdischarge summaries) and critical importance ofour task (causal relations between biomedical enti-ties); three authors of this paper (all with fluency

Page 7: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

Table 4: Clinical abbreviations in the dataset

Abbreviation Expansion Abbreviation Expansion

b/o because of d/c’d discontinuedHCV Hepatitis C Virus abx anti-bioticsDM Diabetes Mellitus c/b complicated bys/p status post h/o history of

in english language and computer science back-ground) annotated the dataset. They followed theprovided guidelines, referred to sources such aswebsites of Centers for Disease Control and Preven-tion (CDC2), National Institute of Health (NIH3)and WebMD4 to understand domain specific key-words or abbreviations and had regular discussionsabout the annotation tasks.

We performed three rounds of annotation, re-fining our guidelines after each round by dis-cussing various complex examples and edge-cases.We achieved an inter-annotator agreement (IAA)Fleiss’ kappa (κ) score of 0.717, which is indica-tive of substantial agreement and the quality of ourannotation guidelines.

We did majority voting over the three availableannotations to obtain final gold annotations for our“MIMICause“ dataset. In case of disagreements,another author of this paper acted as a master an-notator, making the final decision on annotationsafter discussing it with the other three annotators.

A direct comparison of our IAA score with otherwork is not possible due to differences in the num-ber of annotators, annotation labels, guidelines, re-ported metrics etc. for different datasets. However,for reference, we have discussed IAA scores re-ported for the task of semantic link annotations,particularly those where κ scores were reported.Of note is the work by Mostafazadeh et al. (2016b)and their annotation framework CaTeRS for tem-poral and causal relations in ROCStories corpuswhere the final κ score achieved was 0.51 amongfour annotators. Similarly, Bethard et al. (2008)reported a κ score of 0.56 and an F-measure (F-1score) of 0.66 with two annotators labelling foronly two relations viz. causal and no-rel. i.e. Tocompare IAA scores on clinical datasets, tempo-ral relation in the latest Clinical TempEval dataset(Task 12 of SemEval-2017) (Bethard et al., 2017)reported a final agreement (F-1) score of 0.66 be-tween two annotators. However, the relation type

2https://www.cdc.gov/3https://www.nih.gov/4https://www.webmd.com/

in Clinical TempEval is temporal and not causal,making the agreement comparison harder to ascer-tain.

4 Problem definition and Experiments

We defined our task of causality understandingas identification of semantic causal relations be-tween biomedical entities as expressed in clinicalnotes. We have a total of 2714 examples anno-tated with these 9 different classes (8 causal and 1non-causal).

4.1 Problem Formalization

We pose the task of causal relation identifica-tion as a multi-class classification problem f :(X, e1, e2) 7→ y, whereX is an input text sequence,e1 and e2 are the entities between which the relationis to be identified, and y ∈ C is the label from the setof nine relations. These samples are taken from theMIMICause dataset D = {(X, e1, e2, y)m}m=N

m=1 ,where N is the total number of samples in thedataset. The text and entities are mathematicallydenoted as:

X = [x1, x2, . . . , xn−1, xn] (1)

e1 = X[i : j] = [xi, xi+1, . . . , xj ] (2)

e2 = X[k : l] = [xk, xk+1, . . . , xl] (3)

where n is the sequence length, i, j, k and l ∈[1..n], i ≤ j and k ≤ l i.e. entities are sub-sequences of continuous span within the text X .Additionally, j < k or l < i holds i.e. the entitiese1 and e2 are non-overlapping and either of thesecan occur first in the sequence X .

4.2 Models

As a baseline for this dataset, we built our causalrelation classification models using two differentlanguage models5 as text encoders (BERT-BASEand Clinical-BERT) and a fully connected feed-forward network (FFN) as the classifier head. Theencoder output that captures the bi-directional con-text of the input text X ′ through the [CLS] tokenis denoted by H0 ∈ Rd where d = 768 is the di-mension of the encoded outputs from BERT-BASE/ Clinical-BERT. The formulations of the layers of

5We use the implementation of all the encoders from thehuggingface (Wolf et al., 2020) repository

Page 8: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

Figure 1: BERT/Clinical-BERT: FFNFigure 2: BERT/Clinical-BERT: FFN with entity context

the classifier head are given by:

K1 = dropout(ReLU(W1H0 + b1))) (4)

K2 =W2K1 + b2 (5)

p = softmax(K2) (6)

where W1 ∈ Rd′×d, W2 ∈ RL×d′ , d′ was set to256 and L = 9 is the number of labels.

Architectures with additional context introducedbetween the encoder and classifier head by concate-nating averaged representation of the two entitiesand encoder output were also tried which led to im-proved results. The augmented context is denotedby:

He1 =1

j − i+ 1

j∑t=i

Ht (7)

He2 =1

l − k + 1

l∑t=k

Ht (8)

H ′ = concat(H0, He1 , He2) (9)

H0 = dropout(ReLU(W0H′ + b0)) (10)

where i, j, k and l are start and end indices of theentities, Ht ∈ Rd, H ′ ∈ R3d, W0 ∈ Rd×3d andthe augmented context is assigned back to H0 forfeeding into the classifier head. The architecturedetails without and with the entity context augmen-tation are shown in Figure (1) and (2) respectively.An overview of the models is given below:

• Encoder (BERT-BASE / Clinical-BERT)with feed-forward network (FFN) – Theoverall architecture as shown in Figure 1 isa simple feed-forward network built on topof a pre-trained encoder. The input sentenceis fed as a sequence of tokens to the en-coder, with encoder based special tokens such

as [CLS] and entity tagging tokens such as<e1>,</e1>. The overall sentence contextis passed through the fully connected feed-forward network to obtain class probabilitiesas formulated in equations (4)–(6).

In addition to the BERT-BASE encoder, wealso used Clinical-BERT encoder to obtaincontextualised representation of our inputexamples. While BERT is pre-trained onstandard corpus such as Wikipedia, Clinical-BERT is pre-trained on clinical notes andprovides more relevant representation for ourdataset. Replacing BERT-BASE with Clinical-BERT showed significant increase in the eval-uation metrics.

• Encoder (BERT-BASE / Clinical-BERT)with entity context augmented feed-forwardnetwork (FFN) – The overall architecture isshown in Figure 2. While the input mecha-nism with special tokens, encoding and classi-fier head remains the same as discussed earlier,the current architecture also enriches the sen-tence context with both the entity’s context asformulated in equations (7)–(10). The specialtokens around the entities (<e1>, </e1>,<e2>, and </e2>) are used to identify thetokens related to individual entities which arethen used to obtain the averaged context vec-tor for each entity. These are then concate-nated with the overall sentence context andare fed to a fully connected feed-forward net-work to predict the type of causal interactionexpressed in the text.

Similar to our previous discussion, in addi-tion to the BERT-BASE encoder, a pre-trainedClinical-BERT encoder was also used whichresulted in the highest evaluation metrics.

Page 9: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

Table 5: Macro F1 score on test, val and train dataset

Test Val Train

BERT+FFN 0.23 0.25 0.29Clinical-BERT+FFN 0.27 0.31 0.34

BERT+entity context+FFN 0.54 0.27 0.56Clinical-BERT+entity context+FFN 0.56 0.30 0.70

4.3 Results and analysis

We trained all our models on a varied set of hyper-parameters and chose the best model from trainingepochs based on the maximum F1 score on the val-idation set. For BERT+FFN model, we achievedbest scores with batch-size of 128 and learning rateof 5e-5. The other three models achieved reportedscores with a training batch-size of 32 and a learn-ing rate of 1e-3. All the models were trained untilconvergence with early stopping of 7 epochs withno decrease in validation loss. We used AdamWoptimizer with cross entropy loss for all models.

Table 5 shows performance measures of our var-ious models on train/val/test set. Using only theBERT-BASE encoder for the relation identifica-tion doesn’t yield high scores but concatenatingentity context to the BERT’s encoded sentence out-put resulted in significant improvement. UsingClinical-BERT as base encoder resulted in addi-tional improvements, and combining entity con-texts with Clinical-BERT as base encoder resultedin the highest F1 score. While Clinical BERT wastrained on the MIMIC dataset and might have seeninput sequences in the test dataset, it has not seennewly defined causal classes for those sequences.

5 Conclusion

In this work, we proposed annotation guidelinesto capture types and direction of causal associa-tion, annotated a dataset of 2714 examples fromde-identified clinical notes and built models to pro-vide baseline score for our dataset.

Even with inherent complexities in clinical textdata, following the meticulously defined annotationguidelines, we were able to achieve a high interannotator agreement, i.e., Fleiss’ kappa (κ) scoreof 0.72. Building various network architectures ontop of language models, we achieved an accuracyof 0.65 and macro F-1 score of 0.56.

In the future, we are planning to extend our anno-tation guidelines to jointly represent temporal andcausal associations in clinical notes. An end-to-endpipeline built with models for patients’ data de-

identification, biomedical entity extraction, and de-tailed causal and temporal representation betweenthem will help us understand the ordering of vari-ous causal associations and enhance our capabilityfor understanding clinical narrative.

ReferencesSteven Bethard and James H. Martin. 2008. Learning

semantic links from a corpus of parallel temporaland causal relations. In ACL.

Steven Bethard, Guergana Savova, Martha Palmer,and James Pustejovsky. 2017. SemEval-2017 task12: Clinical TempEval. In Proceedings of the11th International Workshop on Semantic Evalua-tion (SemEval-2017).

Jesse Dunietz, Lori S. Levin, and J. Carbonell. 2015.Annotating causal language using corpus lexicogra-phy of constructions. In LAW@NAACL-HLT.

Jesse Dunietz, Lori S. Levin, and J. Carbonell. 2017.The because corpus 2.0: Annotating causality andoverlapping relations. In LAW@ACL.

R. Girju, Preslav Nakov, Vivi Nastase, Stan Sz-pakowicz, Peter D. Turney, and Deniz Yuret. 2007.Semeval-2007 task 04: Classification of semantic re-lations between nominals. In SemEval@ACL.

Harsha Gurulingappa, A. Rajput, A. Roberts, J. Fluck,M. Hofmann-Apitius, and L. Toldo. 2012. Devel-opment of a benchmark corpus to support the auto-matic extraction of drug-related adverse effects frommedical case reports. Journal of biomedical infor-matics, 45 5:885–92.

O. Hassanzadeh, D. Bhattacharjya, M. Feblowitz,Kavitha Srinivas, M. Perrone, Shirin Sohrabi, andMichael Katz. 2019. Answering binary causal ques-tions through large-scale text mining: An evaluationusing cause-effect pairs from human experts. In IJ-CAI.

Iris Hendrickx, S. Kim, Zornitsa Kozareva, PreslavNakov, Diarmuid O Seaghdha, Sebastian Pado,M. Pennacchiotti, Lorenza Romano, and Stan Sz-pakowicz. 2009. Semeval-2010 task 8: Multi-wayclassification of semantic relations between pairs ofnominals. In SemEval@ACL.

Sam Henry, K. Buchan, Michele Filannino, A. Stubbs,and Ozlem Uzuner. 2020. 2018 n2c2 shared taskon adverse drug events and medication extraction inelectronic health records. Journal of the AmericanMedical Informatics Association : JAMIA.

Rei Ikuta, Will Styler, Mariah Hamang, Timothy J.O’Gorman, and Martha Palmer. 2014. Challengesof adding causation to richer event descriptions. InEVENTS@ACL.

Page 10: arXiv:2110.07090v1 [cs.CL] 14 Oct 2021

Alistair E. W. Johnson, T. Pollard, Lu Shen, Li weiH. Lehman, M. Feng, M. Ghassemi, BenjaminMoody, Peter Szolovits, L. Celi, and R. Mark. 2016.Mimic-iii, a freely accessible critical care database.Scientific Data, 3.

Vivek Khetan, Roshni Ramnani, Mayuresh Anand,Subhashis Sengupta, and Andrew E. Fano. 2022.Causal bert: Language models for causality detec-tion between events expressed in text. In IntelligentComputing, pages 965–980, Cham. Springer Interna-tional Publishing.

C. Mihaila, Tomoko Ohta, Sampo Pyysalo, and S. Ana-niadou. 2012. Biocause: Annotating and analysingcausality in the biomedical domain. BMC Bioinfor-matics, 14:2 – 2.

Paramita Mirza and Sara Tonelli. 2014. An analysis ofcausality between events and its relation to temporalinformation. In COLING.

N. Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi Parikh, Dhruv Batra, Lucy Vanderwende,P. Kohli, and James F. Allen. 2016a. A corpus andcloze evaluation for deeper understanding of com-monsense stories. In NAACL.

N. Mostafazadeh, Alyson Grealish, Nathanael Cham-bers, James F. Allen, and Lucy Vanderwende.2016b. Caters: Causal and temporal relation schemefor semantic annotation of event structures. InEVENTS@HLT-NAACL.

Timothy J. O’Gorman, Kristin Wright-Bettner, andMartha Palmer. 2016. Richer event description: Inte-grating event coreference with temporal, causal andbridging annotation.

J. Pustejovsky, J. Castano, R. Ingria, R. Saurı,R. Gaizauskas, A. Setzer, G. Katz, and Dragomir R.Radev. 2003. Timeml: Robust specification of eventand temporal expressions in text. In New Directionsin Question Answering.

Norman Swartz. 2014. The concepts of necessary con-ditions and sufficient conditions.

Leonard Talmy. 1988. Force dynamics in language andcognition. Cognitive Science, 12(1):49–100.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander Rush. 2020. Trans-formers: State-of-the-art natural language process-ing. In Proceedings of the 2020 Conference on Em-pirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

P. Wolff. 2007. Representing causation. Journal ofexperimental psychology. General, 136 1:82–111.