The Impact of Automatic Pre-annotation in Clinical Note ... · Annotation in cNLP refers to the process of manually identifying the mentions of data elements of target signs, symptoms,

TheImpactofAutomaticPre-annotationinClinicalNoteDataElementExtraction

—theCLEANTool

Tsung-TingKuo,PhD1,JinaHuh,PhD1,JihoonKim,MS1,RobertEl-Kareh,MD,MS,MPH1,

SiddharthSingh,MD1,StephanieFeudjioFeupe,MSc1,VincentKuri,MS2,GordonLin,MS2,

MicheleE.Day,PhD1,LucilaOhno-Machado,MD,PhD1,andChun-NanHsu,PhD1,*

1UCSDHealthDepartmentofBiomedicalInformatics,UniversityofCaliforniaSanDiego,La

Jolla,CA,USA

2DepartmentofComputerScienceandEngineering,UniversityofCaliforniaSanDiego,LaJolla,

CA,USA

*9500GilmanDr,SanDiego,CA92093,USA;[email protected];+1(858)822-4931.

Keywords: Natural Language Processing and Text Mining, Human-Computer Interaction,

UsabilityTesting,Pre-Annotation,ClinicalNoteDataElementExtraction

Wordcount:4,000

ABSTRACT

Objective. Annotation is expensive but essential for clinical note review and clinical natural

languageprocessing(cNLP).However,theextenttowhichcomputer-generatedpre-annotation

isbeneficialtohumanannotationisstillanopenquestion.OurstudyintroducesCLEAN(CLinical

note rEview and ANnotation), a pre-annotation-based cNLP annotation system to improve

clinical note annotation of data elements, and comprehensively compares CLEAN with the

widely-usedannotationsystemBratRapidAnnotationTool(BRAT).

Materials and Methods. CLEAN includes an ensemble pipeline (CLEAN-EP) with a newly

developed annotation tool (CLEAN-AT). A domain expert and a novice user/annotator

participatedinacomparativeusabilitytestbytagging87dataelementsrelatedtoCongestive

HeartFailure(CHF)andKawasakiDisease(KD)cohortsin84publicnotes.

Results.CLEANachievedhighernote-levelF1-score(0.896)overBRAT(0.820),withsignificant

differenceincorrectness(P-value<0.001),andthemostlyrelatedfactorbeingsystem/software

(P-value < 0.001). No significant difference (P-value 0.188) in annotation timewas observed

betweenCLEAN(7.262minutes/note)andBRAT(8.286minutes/note).Thedifferencewasmostly

associatedwithnotelength(P-value<0.001)andsystem/software(P-value0.013).Theexpert

reportedCLEANtobeuseful/satisfactory,whilethenovicereportedslightimprovements.

Discussion.CLEANimprovesthecorrectnessofannotationandincreasesusefulness/satisfaction

with the same level of efficiency. Limitations include untested impact of pre-annotation

correctnessrate,smallsamplesize,smallusersize,andrestrictedlyvalidatedgoldstandard.

Conclusion. CLEANwithpre-annotation canbebeneficial for anexpert todealwith complex

annotationtasksinvolvingnumerousanddiversetargetdataelements.

1. BACKGROUNDANDSIGNIFICANCE

Clinical notes with unstructured narrative, such as progress notes, radiology reports, and

dischargesummaries,areoneofthemostinformation-rich,under-utilizedsourcesofhealthcare

data.[1]Criticalaspectsofclinicalqualityareoftendescribedinthefree-textnotesofelectronic

health records (EHR) systems. These important aspects can be used to improve healthcare

delivery/management,clinical/translationalresearch,andultimatelypatienthealth.

Clinical Natural Language Processing (cNLP) is dedicated to developing tools and systems to

extractsuchusefulinformationfrommedicaltext.WidelyusedcNLPtoolsincludecTAKES(clinical

Text Analysis and Knowledge Extraction System),[2]MetaMap,[3]MedEx,[4] CLAMP (Clinical

LanguageAnnotation,Modeling,andProcessingToolkit),[5]Vitals,[6]EFEx,[7]KD-NLP,[8]anda

fewothers.[9-11]ThesecNLPtoolscanextractvarioustypesofinformationsuchasconditionand

medication.

AnnotationincNLPreferstotheprocessofmanuallyidentifyingthementionsofdataelements

of targetsigns, symptoms,events,etc. tobeextracted inclinicalnotes.Annotation,although

imperfect,isanimportantprocessthatprovides:(1)qualitycontrolforfinalcNLPoutputdata,

(2) gold standards to evaluate the performance of cNLP tools, and (3) training examples to

developandimprovecNLPtools.Specifically,trainingexamplesareessentialforthosecNLPtools

based on supervised machine learning, as well as for the development of rule-based cNLP

tools.[9-20]

Annotation,however,isalsothebottleneckofthewholedevelopmentprocessofcNLPtools,[21]

especially when numerous and diverse data elements are to be extracted.[1] From our

experience in the patient-centered SCAlable National Network for Effectiveness Research

(pSCANNER) project,[22] an experienced clinical annotator required an average of 15 to 30

minutes to annotate a clinical note for an annotation task involving the tagging of 41 data

elements.

Intuitively, pre-annotation by a cNLP tool before a manual review might help improve the

correctnessandefficiencyoftheannotationprocess.[12]Inpre-annotation,thementionsofthe

targetdataelementsare identifiedbya cNLP tool.Theseelements serveas suggestions toa

humanannotator,sothattheannotatorcanreviewandrevisethepre-annotatedmentions(“pre-

annotations”)insteadofstartingtheannotationprocessfromscratch.However,previouscNLP

studiesofpre-annotation showedmixedand inconsistent results in termsof correctnessand

efficiencyontasksthatincludednameentityrecognition,[1213]de-identification,[1415]patient

recordschartreview,[1617]corpuscreation,[18]andNLPoutputvalidation,[1920].Noneofthe

studies considered the tailored design of a user interface (UI) to take advantage of pre-

annotation.Inthestudyofnameentityrecognition,someauthorsreportedpositiveresultsof

pre-annotation[12]:“Timesavings[ofpre-annotations]rangedfrom13.85%to21.5%perentity.

Inter-annotator agreement (IAA) ranged from 93.4% to 95.5%. … The time savings were

statistically significant. Moreover, the pre-annotation did not reduce the IAA or annotator

performance.”However, other authors provided contrasting outcomes [13]: “We found little

benefitto[pre-annotate]thecorpuswithathird-partynameentityrecognizer…theannotators

(whowere]giventhe[MetaMapTransfer(MMTx)tool,nowMetaMap][23]annotations,A1and

A3,annotatedslowerthantheothertwoannotators,A2andA4.…Therewasalsonocleartrend

that the MMTx annotation improved pair-wise IAA between individuals.” Most studies only

focusedonalimitednumberoftargetdataelementstobeextracted.Forexample,theauthors

ofthesestudies[12][13]named5and10concepts,respectively.

2. OBJECTIVE

Ourgoalwastotakeadvantageofpre-annotationtoimprovetheannotationprocessofclinical

noteswhennumerousanddiversedataelementsthatarenecessarytosupportphenotypingand

cohort identification were involved. In other words, we sought to design, implement and

evaluateasoftwaresystemthatleveragespre-annotationtoimprovetheannotationqualityof

clinicalnoteswithalargenumber(>50)oftargetdataelements.Thissituationcommonlyoccurs

incNLPusecasesforphenotypingandcohortidentification.[24-26]Tothisend,wedeveloped

theCLinicalnoterEviewandANnotation(CLEAN)cNLPsystemandcomprehensivelycompared

CLEANwiththewidely-usedannotationsystemBratRapidAnnotationTool(BRAT).[21]

3. MATERIALSANDMETHODS

3.1TheCLEANcNLPSystem

Theoverall architectureof theCLEANcNLP system is shown in Figure1. The input toCLEAN

includesa studyplan specifying the targetdisease/condition, categorizeddataelements,and

criteriausedtoselectasetofclinicalnotesfromaninstituteelectronichealthrecord(EHR)data

warehouses.SinceclinicalnotescontainPHI,wedeposit them inasecure,privacy-preserving

computationenvironmentthatalsostoresdataelementsdefinitionsandcategories.

TheCLEANensemblepipeline(CLEAN-EP)pre-annotatestheclinicalnotesusingUnion,[1]which

extracts the data element if any of the cNLP tools identifies the data element and uses the

ensemblemethodtointegratefourcNLPtools(cTAKES,[2]MetaMap[3],MedEx,[4]andKD-NLP

[8]),aswellasournewlydevelopedcNLPtoolROCK(RulesforObesity,Congestiveheartfailure,

andKawasakidisease)describedinAppendixA.1.WepreviouslydiscoveredthatUnionshowed

thehighestrecall/sensitivityamongalloftheensemblemethodsinourpilotstudy[1].Thus,when

using Union as the ensemble method, annotators are provided with a high-coverage pre-

annotateddataelementmentionsinaclinicalnotetext,insteadofsearchingthroughthefulltext

toidentifyanymissingelements.Thatis,annotatorscanfocusoncorrectingthepre-annotations,

insteadofaddingthemissingmentions.AbenefitofapplyinganensemblepipelinelikeCLEAN-

EPisimprovedsystemflexibilitytoreuseandsharecNLPtools.[27-32]DetailsaboutCLEAN-EP

aredescribedinAppendixA.1.

Figure1.TheCLinicalnoterEviewandANnotation(CLEAN)clinicalNaturalLanguageProcessing

(cNLP)System.Afterthestudyplanwasconfirmed,theinputsweretheselecteddataelements

and thequeried clinical notes from theelectronichealth record (EHR)datawarehouse,both

storedonaVirtualMachine(VM).Then,thebuilt-inCLEANensemblepipeline(CLEAN-EP)pre-

annotatedthenotes,followedbytheuser’sreviewusingtheCLEANannotationtool(CLEAN-AT).

Theoutputswerethereviewedannotationsofthedataelementsontheclinicalnotes,andCLEAN

storedthefinalresultsbackontheVM.

TheCLinicalnote rEviewandANnotation (CLEAN)clinicalNaturalLanguageProcessing(cNLP)System

StudyPlan EHRDataWarehouse

ClinicalNotes

………........

………..........

.....

......

.....

DataElements

VM

CLEANEnsemblePipeline

(CLEAN-EP)

Pre-AnnotatedClinical Notes

………...........

………..........

......

.......

.......

..

..

..

..

CLEANAnnotation

Tool(CLEAN-AT)

ReviewedAnnotations

………...........

………..........

......

.......

...........

..

....

Query Export

TransferExecute Pre-Annotate Convert Review

Save

SaveSelect

Load

Finally,anannotatorwillreviewandcorrectthepre-annotationsusingtheCLEANannotationtool

(CLEAN-AT).CLEAN-ATsavesthefinalresultingannotationsandallowsannotatorstore-check

andrefinetheannotationsifrequired.DetailsofCLEAN-ATaregiveninSection3.2.

3.2TheCLEANAnnotationTool(CLEAN-AT)

TheUIdesignoftheCLEAN-ATisillustratedinFigure2,withthefollowingmainfeatures:

• Annotationeditor(rightpanel).ThismainworkingspaceofCLEAN-ATshowstheclinical

noteaswellasallannotationsonthenote.Ausercanedittheannotationsandreview

resultsonthispanel.Inthisstudy,thetermsuserandannotatorareusedinterchangeably.

• Dataelementbrowser(leftpanel).Clickingthedataelementorcategoryintheleftpanel

quicklyidentifiesallannotatedmentionsofthatdataelementorcategoryoftheclinical

noteshownintheannotationeditor. Inthisexample(Figure2),“NatriureticPeptides”

wasclickedinthedataelementbrowsertoidentifythesynonymmention“BNP”inthe

clinicalnotetext.Conversely, intheclinicalnote,thementionscouldbeclicked inthe

annotationeditortohighlightthecorrespondingdataelementinboldinthedataelement

browser.

Figure2.TheCLEANannotationtool(CLEAN-AT).Theleftpanel(dataelementbrowser)showsall

thedataelements,whiletherightpanel(annotationeditor)isforannotationadding,deleting,

andmodifyingviathepop-upmenu(dataelementreminder).Thebottompanel(functionbuttons)

allowsuserstoundo/redoediting,saveprogress,savethenoteascomplete,andproceedwith

thenextnote.TheclinicalnoteshownhereisfrompubliclyavailableMTSamplenotes.[33]

DataElementBrowser Annotation Editor

FunctionButtons

DataElementReminder

• Dataelementreminder(apopupmenuwhiletaggingthetext).Afterright-clickingona

selectedmentionintheannotationeditor,apopupmenuappearswithalistofalldata

elementsgroupedbythecategories.Ausercanusethekeyboardasanauto-completion

shortcuttofilterpossibledataelementsandthuscanspeeduptheannotationediting

process.FortheexampleshowninFigure2,iftheusertypes“n,”CLEAN-ATwoulddisplay

“NatriureticPeptides”and“NT-proBNP,”thedataelementsofwhichnamestartswith“n.”

Theusercan thenchoose fromthe filtereddataelements toannotateor re-annotate

quickly.

• Undo/redo (function buttons in the bottom panel). CLEAN-AT supports an unlimited

number of undo and redo operations, which allows users to edit and delete quickly,

knowing that they could easily recover from mistakes. Note that, deletions are also

considered as an operation, and therefore user can undo/redo deletions. Supporting

undo/redooperationsfollowsanimportantprincipleofuserinterfacedesigntoreduce

distressofnewusers.[34-36]However, fewexistingannotationtools forcNLPsupport

undo/redo,becausetheprogrammustrecordeveryuserinteraction.Also,theprogram

must maintain a stack data structure, to save the entire history of previous user

interactionsthatneedtobeaccessibleanytimetheuserattemptstoundo/redothelast

operations.Ifthesoftwaredidnotincludetheundo/redofunctionintheinitialdesign,it

would be difficult to add this feature because the whole programmight need to be

thoroughly rewritten. CLEAN-AT implements this important feature based on the

JavaScriptlibraryReact.[37]

• Overviewofthecompletionstatusofclinicalnotes.AsshowninFigureA.2inAppendixA,

theusercanstartanewannotationprocess,continueanincompleteannotationprocess,

orrecheckapreviouslycompletedannotationprocess, foranyclinicalnote inatarget

corpus.Althoughtheoverviewpageshowstheoverallinformationforallnotes,CLEAN-

ATbringsupthenextnoteforannotationaftertheusercompletestheannotationofa

clinical note, instead of bringing up this overview page. This featurewas designed to

effectivelykeepauserengaged.

3.3StudyMaterial

WeevaluatedtheperformanceofCLEANusingtwocohorts:(1)CongestiveHeartFailure(CHF)

asanexemplarofamajorhealthconditionintheU.S.,wheretheprevalenceofheartdiseaseis

5.7M;[38] and (2)KawasakiDisease (KD),which represents a rare yet acutediseasewithan

estimatednumberofU.S.hospitalizationsof5,447 in2009.[39]Both conditionsareuse-case

conditionsforthepSCANNERclinicaldataresearchnetwork.[22]Weusedtheclinicalnotepool

andtargetdataelementsdescribedinourpreviouswork[1]withnewlyannotatedgoldstandards,

whicharefurtherdescribedlaterinthissubsection.

Our pool of clinical notes contained notes from public domain datasets,[1] including MT

Samples,[33]i2b2Challenges2009–2012and2014,[40-51]ShAReCLEFeHealthTasks2013Task

1and2(whicharenowpartoftheMIMICIIIclinicaldatabase[52])and2014Task1.[53-55]We

collectedthecorpusofnotesforCHForKDbyselectingnotesthatcontained“congestiveheart

failure”or “KawasakiOR (feverAND rashAND redANDchild),” respectively.BecauseKD isa

relativelyraredisease,[839]weincludedMTSampleclinicalnotescontaining“fever”toincrease

thecorpussizeforKD.Theresultingcorpusconsistedof635notesforCHFand33notesforKD.

In our previous study,[1] CHF and KD subject matter experts had identified the target data

elements,whichwere thenmapped to standardized concept IDsdefined in SNOMED-CT,[56]

LOINC,[57]RxNorm,[58]orUMLS.[59]Aftermapping,thenormalizedoutputdataelementsof

thecNLPtoolswerereadytoserveasinputstotheCLEAN-EPensemblepipeline.

Foreachclinicalnote,anexperiencedphysicianannotatedthementionsofeachdataelementas

thegoldstandard.Duringtheannotationprocess,thephysicianfollowedthecNLPannotation

guidelineswedeveloped.ThephysicianusedtheCLEAN-EPpre-annotationsandtheCLEAN-AT

annotationtooltoannotatethegoldstandarddataelements.

3.4StudyDesign

ThetargetusersofCLEANareclinicalresearchers,front-lineclinicians,andtheirsupportingstaff

members,whowouldliketoidentifyacohortofpatientsfromanEHRsystemfortheirscientific

and/orqualityimprovementprojects.Therefore,ourinclusioncriteriaforatestuserincluded:

(1)anemployeeorastudentofUCSDwhowas³18yearsold,and(2)whohadworkedorwas

soongoingtoworkoncohortidentificationusingannotatedclinicalnotes.Basedonthesecriteria,

twoapprovedtestuserswithrequiredtrainingcertificates(HIPPA1andCITI2)wereselectedto

participate inour study: apracticingphysicianwith clinical training (adomainexpert), anda

graduatestudentintheDepartmentofBiomedicalInformaticswithbasicbiomedicalknowledge

(adomainnovice).Ourannotationtasksinvolvedkeyboard,mouse,audioandvideorecording

topreciselylogallinteractions,withtimestamps,betweenthetestuserandthesoftware.The

InstitutionalReviewBoard(IRB)atUCSDapprovedthisstudywithProjectNumber160410on

April21,2016.

Two-samplepairedt-testwasusedtocomparethemeandifferenceincorrectnessandefficiency

measured with two systems on the same clinical notes. The assumed effect size was 0.5,

computedfrommeanofthecorrectnessmetric(F1-scoreinourstudy)(0.8and0.7forCLEAN

andBRAT,respectively),andstandarddeviation0.2forbothsystems,basedontheexperimental

resultsofourpreviouswork.[1]Theestimatedstatisticalpowerwas78%,witheffectsize0.5,

significance level 0.05 and per-group size 32.[60]Our study had the required sample size to

evaluateCLEANandBRAT,withsignificantstatisticalpower,at least32clinicalnotesforeach

cohort.

Basedontheestimationofsamplesize,Figure3illustratesourprocesstoselecttheclinicalnotes

inthetestdatasetsfromthecorpusofnotes,whichcontained635notesforCHFand33notes

forKD.First,32KDnoteswererandomlyselected,suchthatthenotescouldbedividedequally

1HealthInsurancePortabilityandAccountabilityAct2CollaborativeInstitutionalTrainingInitiative

betweenCLEANandBRAT.Then,52clinicalnotesfortheCHFcohortwererandomlyselected,

foratotalof84clinicalnotesasthetestdatasetforourcomparativestudies(Table1).

The subjectmatter experts had enumerated 87 data elements for the CHF (50) and KD (37)

cohorts.[1]Within the84 testnotes, theexperiencedphysicianannotated1,542mentionsof

these87dataelementsasourgoldstandard.Wesplitthenotesinthetestdatasetinto6,702

sentencesusingtheStanfordCoreNLPlibrary.[61]Thestatisticsofthetestdatasetareshownin

thelastrowofTable1.

Figure3.Theclinicalnotetestdatasetselectionprocessforcongestiveheartfailure(CHF)and

Kawasakidisease(KD)conditions.

Table1.Statisticsofthetotaltestdataset(thelastrow)andthetwotestsets(ID=1and2)of

clinicalnotes.Onetestset(ID=1)wasusedtoevaluateBRAT,whiletheother(ID=2)wasused

toevaluateCLEAN.Conceptfrequencyindicatesthetotalnumberofgoldstandardannotations

intheclinicalnotesinatestset.

TestSetID

EvaluatedSoftware

Notes Words AverageWordsperNote

StandardDeviationWordsperNote

ConceptFrequency

1 BRAT 42 44,219 1,053 676 7242 CLEAN 42 48,998 1,167 706 818

Total(TestDataset) 84 93,217 1,110 689 1,542

PublicdomainclinicalnotesCollectedfromMTSamples,i2b2Challenges2009–2012and2014,ShAReCLEFeHealthTasks2013Task1and2,and2014Task1.N=6,135

NotesfortheCHFcohortFilteredusing“congestiveheartfailure”.

N=635

NotesfortheKDcohortFilteredusing“KawasakiOR(feverANDrashANDredANDchild)”,aswellasmanualreviewusing“fever”fortheMTSamplesdataset.N=33

Samplednotesfor theCHFcohortSampledrandomlyforatotalof≥32clinicalnotestoachieve78%statisticalpowerwithalevelofsignificanceof0.05andeffectsizeof0.5.N=52

Samplednotesfor theKDcohortSampledrandomly foratotalof≥32clinicalnotestoachieve78%statisticalpowerwithalevelofsignificanceof0.05 andeffectsizeof0.5.N=32

SamplednotesincomparativestudiesCombinedasthetestdataset.N=84

ClinicalNotePool

ClinicalNoteCorpus

PerCondition

ClinicalNoteTestDatasetPerCondition

ClinicalNoteTestDataset

3.5EvaluationProtocol

The84clinicalnotesinthetestdatasetwererandomlydividedintotwotestsets,eachwith42

notes(26CHFand16KD).OnetestsetwasusedtoevaluateBRAT,whiletheotherwasusedto

evaluateCLEAN.Table1describesthestatisticsofthetwotestsets.

Bothusersevaluatedeachsoftwarewiththesametestsetandsameorderofclinicalnotes,but

theorderofthesoftwareevaluatedwasrandomlyassigned.Therandomizationassigneduser-1

toannotatethenotesusingCLEANfirstandthenBRATforCHF(butBRATfirstforKD),whileuser-

2annotatedthenotesusingBRATfirstthenCLEANforCHF(butCLEANfirstforKD).

Each evaluation session took about 1.5 to 2 hours long. Before the annotation, both users

attendeda20-minutetrainingsessiononthecNLPannotationguidelines,dataelementtables,

and the usage of each annotation software. The users could rest at any time during the

annotationprocessorcompletelyceasethesessionasneeded.

3.6EvaluationMeasurements

ThecomparativestudyofCLEANandBRATexploitedthreetypesofmeasurements:correctness,

efficiency, and usefulness/satisfaction,which are discussed in Section 3.6.1, 3.6.2, and 3.6.3,

respectively.

3.6.1 CorrectnessMeasurements

Thegoldstandardsallowedustoassessthequalityoftheuser-reviewedannotations.Forthis

purpose,ourevaluationmetricwasF1-score,computedundernote-levelandsentence-level.The

detaileddefinitionsofthenote-levelandsentence-levelF1-scoresaredescribedinAppendixA.2.

Twomixed-effectsmodelswereconsideredforeachresponsevariableofthenote-andsentence-

level F1-score, with the computation of intra class correlation (ICC). Stepwise backward

eliminationmethodwasappliedtoselectvariablesamongalistofcandidatevariablesconsisting

ofsoftware(CLEANversusBRAT),condition(CHFversusKD),length(wordcountoftheclinical

note), andconcept frequency (numberof thegold standardannotations in theclinicalnote),

whilehavingreviewer(graduatestudentversusphysician)asarandomeffect,withasignificance

levelof0.05.

3.6.2 EfficiencyMeasurements

The main metric of efficiency was the average annotation time, in minutes, to finish an

annotationtaskforeachclinicalnote.Amixed-effectsmodel,similartothemodelsdescribedin

Section 3.6.1, was considered for each response variable of the annotation time, with the

computationofICC.Inaddition,weconsideredthetotalnumberofkeyboardpressesandmouse

clicks to better understand the user activities required to operate the software. The

keyboard/mouseinteractionsandtheaudio/videorecordingswereloggedusingtheRUItool[62

63]andtheOBSStudiotool,[64]respectively.Basedonourassumptionthatthekeyboardpresses

andmouseclickswereproportionaltothelengthofaclinicalnote,countswerenormalizedper

wordtoserveasourmetricforuseractivities.

3.6.3 Usefulness/SatisfactionMeasurements

This measurement consisted of scale surveys and qualitative interviews. The usability scale

surveywasa7-point(1–7Likertscale,7beingbest)questionnaire,[65]with6questionsrelated

toPerceivedUsefulnessand5questionsrelatedtoPerceivedEaseofUse.Thesatisfactionscale

surveywasa10-point(0–9Likertscale,9beingbest)questionnaire,[66]with6questionsfor

TerminologyandSystemInformation,6questionsforLearning,6questionsforOverallReaction

totheSoftware,4questionsforScreen,and5questionsforTerminologyandSystemInformation.

Aftereachannotationsession,theusersfilledoutthesurveysonthesystem,andexpressedany

commentstheyhadaboutcriticalevents,suchasnotbeingabletofindabutton,inaninterview

meeting.

4. RESULTS

4.1CorrectnessResults

TheresultsofthecorrectnesscomparisonofthetwosystemsareshowninTable2.Ingeneral,

CLEANimprovedtheprecision,recallandF1-scoreforbothnote-andsentence-levelevaluations.

The comparison results using boxplots (Figure 4 (a) and (b)) further indicated that CLEAN

improvedthecorrectnesssignificantly,withresultingP-values<0.001and0.004fornote-and

sentence-level,respectively.

The F1-scores of the cNLP pre-annotation are given below. For note-level, the cNLP pre-

annotationF1-scoreswere0.695fortheclinicalnotesinBRATand0.791forthenotesinCLEAN.

Forsentence-level,thecNLPpre-annotationF1-scoreswere0.529fortheclinicalnotesinBRAT

and0.500forthenotesinCLEAN.Thatis,theF1-scoreofCLEANimprovedfrom0.791to0.896

innote-level,andfrom0.500to0.719forsentence-level,aftertheuserreviewviatheCLEAN-AT.

Notethat, inourexperiments,CLEANexploitedthecNLPpre-annotatedmentionswhileBRAT

didnot.ItshouldalsobenotedthattheKDclinicalnotesincludedanoutlierdatapoint,which

generatedzeroprecision,recallandF1-scoreandwasreplacedbyusinginterpolation.

Figure4.(a)Boxplotforthenote-levelF1-score.(b)Boxplotforthesentence-levelF1-score.(c)

Boxplotforannotationtime.

Table2.ComparisonofthegoldstandardwithannotationresultsfromBRATandCLEAN.The

metricswerenote-levelandsentence-levelaveragedprecision, recall andF1-score,with95%

confidenceinterval(95%CI).

Level Software Precision(95%CI) Recall(95%CI) F1-score(95%CI)

NoteBRAT 0.855(0.826to0.885) 0.816(0.780to0.852) 0.820(0.794to0.846)CLEAN 0.895(0.870to0.920) 0.913(0.890to0.936) 0.896(0.876to0.916)

SentenceBRAT 0.630(0.582to0.677) 0.614(0.565to0.664) 0.616(0.568to0.665)CLEAN 0.727(0.689to0.764) 0.723(0.684to0.763) 0.719(0.681to0.757)

●

●

●

●

●

●

●

n=84

n=84

0.6

0.8

1.0

BRAT CLEANSoftware

F1no

te

SoftwareBRAT

CLEAN

P value = 3.7e−06

●

●

●

n=84

n=84

0.00

0.25

0.50

0.75

1.00

BRAT CLEANSoftware

F1se

nten

ce SoftwareBRAT

CLEAN

P value = 0.003566

●

●

●

●

●

●

●

●

●

n=84n=84

0

10

20

30

40

BRAT CLEANSoftware

Tim

e (m

inut

es)

SoftwareBRAT

CLEAN

P value = 0.1875118

(a)Note-levelF1 (b)Sentence-levelF1 (c)TimeinMinutes

Finally,theintraclasscorrelation(ICC)withinanannotatorwas0.263and0.272fornote-and

sentence-levelF1-score,respectively.Thefinalmixed-effectsmodelsshowedthatbothlevelsof

F1-scoreweremostlyrelatedtosoftwaredifferences,withap-value<0.001.Thenote-levelF1-

scorewas also related to concept frequency and condition. The details of themixed-effects

modelingresultsareshowninTable3.Themodelplotsofnote-andsentence-levelF1-scoreare

shown in Figure 5 and A.3 in Appendix A, respectively. Under note-level, the F1-score

improvementofCLEANoverBRATwasmoresalientandthep-valuewasalsomuchlower.

Figure5.Theresultsofthelinearmixed-effectsmodelusingnote-levelF1-scoreasresponseand

annotatorsasrandomeffect.Thesoftware,conceptfrequencyandconditioneffectsareincluded

accordingtothehighestrelevancyshowninTable3.

●

●

●

● ●

●

●

● ●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

● ●

●

GradStud MD

CHF

KD

2 4 6 2 4 6

0.6

0.8

1.0

0.6

0.8

1.0

ConceptFreq

F1note Software

●

●

BRAT

CLEAN

Table3. Final linearmixed-effectsmodelusingnote- and sentence-level F1as response, and

annotatorsasrandomeffect.Themostrelevantfactorsareshowninboldtext.

Level Effect Estimate StandardError P-value

Note

(Intercept) 0.725989 0.043538 0.000001Software 0.070426 0.014759 0.000004

ConceptFrequency 0.020996 0.007027 0.003235Condition 0.037190 0.015439 0.017095

Sentence(Intercept) 0.615911 0.061387 0.006664Software 0.102887 0.028570 0.000418

4.2EfficiencyResults

The average annotation time inminutes-per-note for BRATwas 8.286 (with 95%Confidence

Intervalfrom6.788to9.783),whilethetimeforCLEANwas7.262(with95%ConfidenceInterval

from5.859to8.665).Thecomparisonresultsusingboxplots(Figure4(c))showedthatCLEAN

improvedtheannotationtime,butnotsignificantly(p=0.188).

Also, the ICC within an annotator was 0.418. The final mixed-effects model indicated that

annotationtimewasmostlyrelatedtoclinicalnotelengthwithP-value<0.001.Thesecondmost

significanteffectwasthesoftwaredifferencewithP-value=0.013.Furthermore,theannotation

timewasrelatedtoconditionandconceptfrequency.ThedetailsofthemodelareshowninTable

4andFigure6,whilethedetailedtimeanalysisresultsareshowninTableA.1inAppendixA.

Figure6.Theresultsofthelinearmixed-effectsmodelusingannotationtimeasresponseand

annotatorsasrandomeffect.Thelength,software,andconditioneffectsareincludedaccording

tothehighestrelevancyshowninTable4.

●

●

●

●

●●

●

●

●

●

●

● ●●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●●●

●●

●●

●

●●

●●

●

●

●●

●

●●

●

●

●● ● ●

●

●

●●

●

●

●

●●

● ●●

● ●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●●

●

●

● ●●

●

●

● ●

●

●

● ●●

●●

●

GradStud MD

CH

FKD

8 9 10 11 8 9 10 11

0

10

20

30

40

0

10

20

30

40

Length

Tim

e (m

inut

es)

Software●

●

BRAT

CLEAN

Table4.Finallinearmixed-effectsmodelusingannotationtimeasresponseandannotatorsas

randomeffect.Thetoprelevantfactorsareshowninboldtext.

Effect Estimate StandardError P-value(Intercept) -26.888250 4.491174 0.000007Software -1.716550 0.685926 0.013298Length 3.346265 0.482188 0.000000

ConceptFrequency 0.851998 0.429080 0.048720Condition -1.812721 0.832729 0.030905

Thestatisticsofuseractivities,includingkeyboardpressesandmouseclicks,areshowninTable

A.2 in Appendix A. The averaged-and-normalized user activities for BRATwas 0.094 (or 10.6

words reviewed per press/click), and for CLEAN, it was 0.076 (or 13.2 words reviewed per

press/click).Therefore,CLEANreducedbothnormalizedmouseclicksandnormalizedkeyboard

presses.Also,FigureA.4inAppendixAillustratestheresultsperannotator,showingthatCLEAN

decreasedthenormalizeduseractivitiescomparedtoBRATforbothannotators.

4.3PerceivedUsefulnessandUserSatisfactionResults

Thesummaryof thesurveyresults is shown inTableA.3 inAppendixA,while the full survey

resultsareshowninFigureA.5inAppendixA.Theinterviewresults,includingcriticaleventsand

overallfeedbackofBRATandCLEAN,aredescribedinAppendixA.3.Ourfindingsbasedonthe

questionnaireresultsandinterviewsareasfollows:

• PerceivedUsefulnessandEaseofUse.Thephysicianannotatorfoundsuperiorusefulnessof

CLEANover BRAT,with average difference of 3.600 in satisfaction score values over two

categories,andthegraduatestudentannotatorreportedinsignificantresultswithaverage

differenceof0.215.

• User Interface Satisfaction. While the physician annotator perceived a higher level of

satisfactionof CLEAN compared toBRAT,with averagedifferenceof 4.204 in satisfaction

score values over five categories, the graduate student annotator showed a modest

difference(0.364).

TheadvantagesofBRATreportedbytheannotatorsincludedlowlearningcurveofthetool,the

ability to easily navigate to next notes to annotate, and the auto-save feature of the work

progress.However,theBRATinterfaceshowedinefficiencyasasystem:thefeedbackspeedto

userinteractionwasslowandfindingadataelementforannotationrequiredmanyclicks.The

participantshad toclickmultiple times toselecta text,oftengivingunwanted text selection,

making them cancel the operation and restart the task. The text position often shifted after

annotations,makingitdifficultfortheparticipantstoreorientwhereintheannotationprocess

theywereworkingon.

ThereportedbenefitsofCLEANincludedquicklearningcurveandeaseofuse,similartoBRAT.

However, the participants saw pre-annotation, whichwas a unique feature to CLEAN, to be

helpfulandfacilitatingoftheirannotationprocess.Thetoolallowedeasyadditionordeletionof

annotations, providing immediate feedback to each user interaction, leading to efficient

annotationtasks.TheparticipantsalsosawCLEANtohaveapotentialtobeusedasapredictive

analysistool.TheparticipantsalsonotedtheimprovementsthatcanbemadewithCLEANasit:

allowssingle-clickwordselection,hasmorekeyboard-assisteddataelementselections,enables

afontsizeadjustmentfeature,andincludesmoremedicationsynonyms.

5. DISCUSSION

Basedontheresultsfromourcomparativestudy,wefoundthat,forthetwotesters,CLEANcould

improve correctness (precision, recall and F1-score) of annotation and increase annotation

quality.Especially,CLEANimprovednote-levelF1-scoreaccordingtothecorrespondingmixed-

effectmodel.CLEANcanalsoremainedatthesamelevelofannotationefficiency.Therewasno

significantdifferenceinthenote-levelprecision,astheconfidenceintervalsofBRATandCLEAN

wereoverlapping.However,forthesentence-level,thedifferenceinprecisionwassignificant.

Also, perceived efficiency of the tool was higher for CLEAN than for BRAT according to the

interviewresults.Therefore,forhigh-granularityannotationtasksthatlookintoeachsentence,

CLEANcanhelptheannotatorsprovidemorepreciseannotationresults.

CLEANcouldalsolowerthenumberofrequireduseractivitiesanddecreasetheannotationtime

for a physician annotator, but the time decrement was not significant in general. From the

surveysandinterviews,thephysicianannotatorreportedmoreusefulness/satisfactionforCLEAN

comparedtothegraduatestudent.Wehypothesizethatthismaybethecasethatthestudent

annotatorneedstospentsubstantialamountoftimelookinguponpotentialkeywordcategories

andconfirmingmanyofcNLPpre-annotationmentions.

Onelimitationofourstudyisthatwedidnottesthowthepre-annotationcorrectnessrateaffects

auser’sexperience.However,wepredictthatwithmoreaccuratecNLPpre-annotationresults,

theuserexperienceandannotationqualitywillimprove.Inadditiontothatthesmallnumberof

notes,ourexperimentincludedonlyoneuserofeachtype(domainexpertornovice)andthus

theseresultsmaynotbegeneralizedwelltoalluserswithinatype,suchasallgraduatestudents.

Finally,ourcurrentgoldstandardannotationswerecreatedbyoneannotatorandmayrequire

furthervalidation.

6. CONCLUSION

ThisstudyevaluatedtheCLEANcNLPsystem,whichincludedCLEAN-EPtoautomaticallygenerate

pre-annotationsandCLEAN-ATforannotatorstoreviewthemachine-generatedpre-annotations.

The study compared CLEAN with BRAT and found that CLEAN demonstrated improved

correctness and better usefulness/satisfaction, especially for a physician annotator, while

retainingthesamelevelofefficiency.CLEANcouldhelpaddressthebottleneckofannotationin

thecNLPconceptextractionpipelinetosupportphenotypingandcohortidentification.

ACKNOWLEDGEMENT

Partofthede-identifiedclinicalrecordsusedinthisresearchwereprovidedbythei2b2National

CenterforBiomedicalComputingfundedbyU54LM008748andwereoriginallypreparedforthe

SharedTasks forChallenges inNLPforClinicalDataorganizedbyDr.OzlemUzuner, i2b2and

SUNY. The computational infrastructure was provided by the iDASH National Center for

Biomedical Computing funded by U54HL108460 and managed by the Clinical Translational

Research Institute CTSA Informatics team led by Antonios Koures, PhD, funded in part by

UL1TR001442.

FUNDINGSTATEMENT

Tsung-TingKuo,JinaHuh,RobertEl-Kareh,GordonLin,MicheleE.Day,LucilaOhno-Machado,

andChun-NanHsuarepartiallyfundedthroughaPatient-CenteredOutcomesResearchInstitute

(PCORI)Award(CDRN-1306-04819).Thestatementsinthisarticlearesolelytheresponsibilityof

the authors and do not necessarily represent the views of PCORI, its Board of Governors or

MethodologyCommittee.Partof thede-identifiedclinical recordsused in this researchwere

providedbythei2b2NationalCenterforBiomedicalComputingfundedbyU54LM008748and

wereoriginallypreparedfortheSharedTasksforChallengesinNLPforClinicalDataorganizedby

Dr.OzlemUzuner,i2b2andSUNY.ThecomputationalinfrastructurewasprovidedbytheiDASH

NationalCenterforBiomedicalComputingfundedbyU54HL108460.

COMPETINGINTERESTSSTATEMENT

Theauthorshavenocompetingintereststodeclare.

CONTRIBUTORSHIPSTATEMENT

T-TKdesignedand implemented the system, conducted literature review, collected thedata,

developed the annotation guideline, provided training sessions, performed experiments,

analyzedtheresults,anddraftedthemanuscript.JHprovidedfeedbacksonthestudyandsystem

design, suggested critical directions for efficiency, usability and satisfactory evaluations,

performedexperiments,andeditedthemanuscript.JKprovidedfeedbackonthestudydesign,

conducted sample size estimation, performed the mixed-effects model analysis, provided

insightstopresenttheresults,andeditedthemanuscript.RE-Kprovidedfeedbackonthestudy

andsystemdesign,developedtheannotationguideline,annotatedthegoldstandards,provided

criticaldiscussionpoints,andeditedthemanuscript.SSandSFFprovidedfeedbackonthestudy

andsystemdesign,developedtheannotationguideline,evaluatedthesoftwaresystems,made

suggestions for system efficiency, usability and satisfaction improvement, and edited the

manuscript. VK and GL implemented the system and edited the manuscript. MED provided

feedbacksontheidea,andeditedthemanuscript.LO-Mwasprincipalinvestigatoroftheproject;

providedoverallsupervisionoftheprojectandcriticaleditingofthemanuscript.C-NHprovided

theoriginalidea,annotationguidelinesuggestions,criticaldiscussionpoints,overallsupervision

ofthestudyandcriticaleditingofthemanuscript.

APPENDIXA

A.1TheCLEANEnsemblePipeline(CLEAN-EP)

TheCLEANclinicalnaturallanguageprocessing(cNLP)systemconsistsofanensemblepipeline

(CLEAN-EP)asshowninFigureA.1.CLEAN-EPwasdescribedinourpreviouswork.[1]Theinputs

ofthepipelineweredataelementsandclinicalnotesdeterminedbythestudyplan,whilethe

outputswerethepre-annotatedclinicalnotesstoredonavirtualmachine(VM)readyforreview.

Thepipelineconsistsoffoursteps:

• Preprocessor.CLEAN-EPconvertstheclinicalnotestoplaintext,transformsthemtoUTF-8

encoding,andsplitsthemintosentencesbyusingtheStanfordCoreNLPlibrary.[61]

• Toolkit. CLEAN-EP included three general-purpose cNLP tools: (1) cTAKES (clinical Text

AnalysisandKnowledgeExtractionSystem),[2]ancNLPtoolforinformationextractionfrom

freetextclinicalnotesinEHR;(2)MetaMap,[3]atoolforrecognizingUMLSconceptsintext;

and(3)MedEx,[4]atoolspecializedinextractingmentionsofmedications.Thesetoolscover

awiderangeofapplicationsofcNLPforclinicalnoteinformationextractionandservethe

basicneedsofclinicalnoteprocessing.CLEAN-EPalsointegratedtwospecializedcNLPtools

forpSCANNER[22]conditions:KD-NLP,[8]atoolspecializedinidentifyingclinicalsignsofKD;

and ROCK (Rules for Obesity, Congestive heart failure, and Kawasaki disease), a newly

developedcNLPtoolwhich isspecialized inextractingcommondataelements forWM/O,

CHF,andKD.ThepipelineofROCKconsistedofsentencesplitting,rule-basedtaggingusing

regularexpressions,negationdetectionwithNegEx[67],andsynonymidentificationusing

SNOMED-CT[56]synonyms,LOINC[57]componentnames,RxNorm[58]tradenames,aswell

as known names for KD found on four public domain medical websites: WebMD [68],

MedScape[69],RxList [70],andDrugs.com[71].Notethat,cTAKESexploitstheDictionary

LookupFastPipeline[72]andthebuilt-inconceptdictionary,whichisasubsetofUMLS[59]

includingSNOMED-CT[56],RxNorm[58],andallsynonyms.Itshouldalsobenotedthatour

previousstudyonlyincludedcTAKESandMetaMap,[1]whileCLEAN-EPcurrentlyintegrates

fivecNLPtools:cTAKES,MetaMap,MedEx,KD-NLPandROCK.

• Ensembler. CLEAN-EP adopted the Union ensemble method, reported to consistently

outperformasinglecNLPtool.[1]TheUnionmethodfocusesonimprovingthecoverageof

thecNLPresults,byextractingadataelementifatleastoneofthecNLPtoolsidentifiesthat

dataelement.[1]

• Postprocessor.CLEAN-EPgeneratesthepre-annotationsintheformatofBRAT[21]forthe

evaluationprocesses.

FigureA.1TheflowchartoftheCLEANensemblepipeline(CLEAN-EP).[1]Theinputsweredata

elementsandclinicalnotes,whiletheoutputswerethepre-annotatedclinicalnotes.Theoriginal

ensemblepipeline[1]containedtwocNLPtools:cTAKES[2](clinicalTextAnalysisandKnowledge

ExtractionSystem)andMetaMap.[3]Toincreasetheextractioncoverage,CLEAN-EPintegrated

threeadditionalcNLPtools:MedEx,[4]KD-NLP,[8]andanewlydevelopedtoolROCK(Rulesfor

Obesity, Congestive heart failure, and Kawasaki disease). Also, CLEAN adopted the Union

ensemblemethod.

Toolkit Ensembler Postprocessor

Union AnnotationTags

ClinicalNotes

Pre-AnnotatedClinicalNotes

………...........

………..........

......

.......

.......

………........

………..........

.....

......

.....

..

..

..

..

Preprocessor

FormatConverter

EncodingConverter

SentenceSplitter

cTAKES

DataElements

MetaMap

MedEx

KD-NLP

ROCK

FigureA.2.AnexampleshowingthecompletionstatusofclinicalnotesintheCLEANannotation

tool(CLEAN-AT).Ausercouldrecheckcompletednotes(inthisexample,theupperthreenotes)

byclickingtheorange“Recheck”buttons,andcouldreviewuncompletednotes(inthisexample,

thelowerfivenotes)byclickingtheblue“Review”buttons.

CompletedNotes

UncompletedNotes

A.2TheNote-LevelandSentence-LevelF1-Scores

Atnote-level,anannotatedmentionofadataelementwasregardedasatruepositiveifthedata

element appeared in the gold standard annotations for the samenote. At sentence-level, an

annotateddataelementmentionwasatruepositiveifitappearedinthesamesentenceasany

goldstandardannotation for thedataelement. Inourexperiment,dataelementannotations

were consideredbinaryatboth levels, and thereforemultipleannotationsofadataelement

wouldonlybecountedasatruepositive.Forexample,considerthefollowingsentence:“The

patientisonbeta-blockersandCoumadin,continuebeta-blockers.”Althoughbothmentionsof

themedicationconcept“beta-blockers”arecorrectlyannotated,thetruepositivecountisstill

onewhilecomputingthesentence-levelF1-score.Thesamerulealsoholdsforthecomputation

ofnote-levelF1-score.Thedefinitionofnote-levelF1-scoreofaclinicalnotewastheharmonic

meanofprecisionPandrecallRformulatedas2*P*R/(P+R),whereP=(#oftruepositive

annotateddataelements)/(#ofuniquedataelementsannotatedinthenote),andR=(#oftrue

positiveannotateddataelements)/(#ofuniquegoldstandarddataelementsinthenote).The

definitionofsentence-levelF1-scoreofaclinicalnotewasS(Fi)/N,whereN=#ofsentencesin

theclinicalnote,Fi=2*Pi*Ri/(Pi+Ri)foreachsentencei,Pi=(#oftruepositiveannotated

dataelementsinsentencei)/(#ofuniquedataelementsannotatedinsentencei),andRi=(#of

truepositiveannotateddataelementsinsentencei)/(#ofuniquegoldstandarddataelements

insentencei).

Figure A.3. The results of the linear mixed-effects model using sentence-level F1-score as

responseandannotatorsasrandomeffect.

●

●●

●

●●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

GradStud MD

CHF

KD

2 4 6 2 4 6

0.0

0.3

0.6

0.9

0.0

0.3

0.6

0.9

ConceptFreq

F1sentence Software

●

●

BRAT

CLEAN

A.3TheInterviewResults

ThecriticaleventsandoverallfeedbackofBRATandCLEANareasfollows:

• Interview for critical events of BRAT. The reported/observed issues are: (1) no pre-

annotations as “red flags” existed to assist the annotation process; (2) users tended to

highlightaterm(orevena largeparagraphofthetext)accidentally,andthusrequiredan

additionalcanceloperation;and(3)ifuserszoomedthebrowserwindowinforlargertext,

thenthebrowsercutthepopupwindowatthebottom.

• InterviewforcriticaleventsofCLEAN.Thereported/observedissuesare:(1)itconstantlypre-

annotatedsometermsasthewrongdataelement,forexampleall“history”mentionswere

pre-annotated as “past medical history,” thus the users were required to repeatedly

delete/fix these data elements; (2) the coverage of medication pre-annotation could be

improved;and(3)basedoncurrentdesign,thecopy-and-pastedidnotworkwellwhiletrying

toquerytheconceptsviasearchengineslikeGoogle.

• InterviewforoverallfeedbackofBRAT.Theannotatorsreportedthefollowingadvantagesof

BRAT:(1)itwaseasytolearn;(2)itwassimpleandstraightforwardtounderstand,without

toomanybellsandwhistles;(3)itprovidednavigationtonextnoteswithsimpleclicking;and

(4)itprovidedauto-savingfeature.TheannotatorsreportedthefollowingburdensofBRAT:

(1)theprocessingwasslow;(2)theusersneededtoclickmanytimestoselectannotation

text; (3)afterannotation, the screendidnot remainat the siteofannotation,but rather

shiftedback;and(4)searchingfordataelements,especiallymedicationones,sloweddown

theannotationprocess.

• InterviewforoverallfeedbackofCLEAN.Theannotatorsreportedthefollowingbenefitsof

CLEAN:(1) itwasveryquickandeasyto learn;(2) itdidagoodjobofpre-annotation;(3)

adding and deleting an annotation were easy; (4) the user felt more interaction while

annotating;(5)itwaseasytouse,assumingtheaccuratelearningaspectofthesoftware;(6)

itwasagreat tool forannotatingclinicalnotes,especiallyonaconditionbasis;and (7) it

providedgreatpotentialforpredictiveanalysis.Theusersreportedthefollowingobstacles:

(1)itwouldbehelpfulifwithasingle-clickonaword(insteadofclicking-and-selectingthe

wordfollowedbyarightclick),thatwordgothighlightedwiththeannotationdataelement

reminderopenedautomatically;(2)inselectingtheannotation,itmightbehelpfultoallow

keyboardusetoselectanoption,oncethespellinghasstarted;(3)userscouldnotadjustfont

sizeeasily;and(4)itmightbehelpfultoaddmoremedicationdefinitionsandsynonymsto

thedictionariesofthecNLPtoolstoincreasethepre-annotationcoverage.

FigureA.4.Theresultsofnormalizeduseractivities,includingkeyboardpressesandmouseclicks,

for physician (MD) and graduate student (GradStud) annotators. The activity count was

normalizedbyusingthelengthofclinicalnote(perword).

0.0000

0.0200

0.0400

0.0600

0.0800

0.1000

0.1200

0.1400

BRAT(GradStud) CLEAN(GradStud) BRAT(MD) CLEAN(MD)

ActivitiesperWords

Length-NormalizedUserActivities

Mouse Keyboard

FigureA.5.Fullresultsoftheusefulness/satisfactionsurveys,completedbythegraduatestudent

(GradStud)andM.D.(MD)annotators.

Survey(Scale) Category Question BRAT

GradStudCLEAN

GradStudBRATMD

CLEANMD

Usefulness(1– 7)

PerceivedUsefulness

Usingthesysteminmyjobwouldenablemetoaccomplishtasksmorequickly 6 7 2 7

Usingthesystemwouldimprovemyjobperformance 7 7 2 7Usingthesysteminmyjobwouldincreasemyproductivity 7 7 2 7Usingthesystemwouldenhancemyeffectivenessonthejob 5 7 2 7Usingthesystemwouldmakeiteasiertodomyjob 6 7 2 7Iwouldfindthesystemusefulinmyjob 6 7 2 7

PerceivedEaseofUse

Learningtooperatethesystemwouldbeeasyforme 7 7 5 7IwouldfinditeasytogetthesystemtodowhatIwantittodo 7 6 4 6Myinteractionwiththesystemwouldbeclearandunderstandable 7 6 4 6Iwouldfindthesystemtobeflexibletointeractwith 7 7 3 6Iwouldfindthesystemeasytouse 7 7 4 6

Satisfaction(0 – 9)

TerminologyandSystemInformation

Useoftermsthroughoutsystem:inconsistent(0)- consistent(9) 9 9 4 8Terminologyrelatedtotask:never(0)- always(9) 7 5 3 8Positionofmessagesonscreen:inconsistent(0)- consistent(9) 9 9 4 8Promptsforinput:confusing(0)- clear(9) 9 9 3 8Computerinformsaboutitsprogress:never(0)- always(9) N/A 0 1 7Errormessages:unhelpful(0)- helpful(9) N/A 9 2 8

Learning

Learningtooperatethesystem:difficult(0)- easy(9) 9 9 5 8Exploringnewfeaturesbytrialanderror:difficult(0)- easy(9) 9 9 3 8Rememberingnamesanduseofcommands:difficult(0)- easy(9) 5 N/A 4 8Performingtasksisstraightforward:never(0)- always(9) 9 9 4 8Helpmessagesonthescreen:unhelpful(0)- helpful(9) N/A 9 3 8Supplementalreferencematerials:confusing(0)- clear(9) 8 9 3 8

OverallReactionto the

Software

terrible(0)- wonderful(9) 8 8 3 8difficult(0)- easy(9) 8 9 7 8frustrating(0)- satisfying(9) 9 9 4 8inadequatepower(0)- adequatepower(9) 8 7 4 8dull(0)- stimulating(9) 8 9 3 8rigid(0)- flexible(9) 9 9 2 8

Screen

Readingcharactersonthescreen:hard(0)- easy(9) 9 9 3 8Highlightingsimplifiestask:notatall(0)- verymuch(9) 5 9 3 8Organizationofinformation:confusing(0)- veryclear(9) 9 7 3 8Sequenceofscreens:confusing(0)- veryclear(9) 9 9 5 8

SystemCapabilities

Systemspeed:tooslow(0)- fastenough(9) 7 9 2 7Systemreliability:unreliable(0)- reliable(9) 8 9 3 7Systemtendstobe:noisy(0)- quite(9) 5 9 7 9Correctingyourmistakes:difficult(0)- easy(9) 9 9 3 6Designedforalllevelsofusers:never(0)- always(9) 9 9 3 7

TableA.1.Timeanalysisresults(GradStud=GraduateStudent,MD=MedicalDoctor).

Condition Annotator SoftwareAverageTimeperNote

(minutes)StandardDeviationof

TimeperNote(minutes)

CongestiveHeartFailure(CHF)

GradStudBRAT 9.962 5.611CLEAN 11.808 8.266

MDBRAT 6.154 4.315CLEAN 3.500 2.045

KawasakiDisease(KD)

GradStudBRAT 13.813 10.647CLEAN 10.063 5.360

MDBRAT 3.500 2.191CLEAN 3.188 1.471

TableA.2.UseractivitiesforBRATandCLEAN,averagedforbothgraduatestudentandphysician

usersandnormalizedbyusingthelengthofclinicalnote(perword).

SoftwareMouseCount

KeyboardCount

TotalCount

WordCount

MouseNormalized

Count

KeyboardNormalized

Count

TotalNormalized

CountBRAT 2,629.5 1,553.0 4,182.5 44,219 0.059 0.035 0.094CLEAN 2,057.0 1,667.5 3,724.5 48,998 0.042 0.034 0.076

TableA.3.Summaryoftheusefulness/satisfactionsurveyresultsforphysician(MD)andgraduate

student (GradStud) annotators. The upper-part summarizes the survey results for perceived

usefulnessandeaseofuse,whilethelower-parttheresultsforuserinterfacesatisfaction.Our

studyexcludedfoursatisfactorysurveyquestionswithN/Aanswerswhencomputingtheaverage

scoresforeachcategoryofquestions.

Survey(Scale)

Category BRAT(GradStud)

CLEAN(GradStud)

BRAT(MD)

CLEAN(MD)

Usefulness(1–7)

PerceivedUsefulness 6.17 7.00 2.00 7.00PerceivedEaseofUse 7.00 6.60 4.00 6.20

Satisfaction(0–9)

OverallReactiontotheSoftware 8.33 8.50 3.83 8.00Screen 8.00 8.50 3.50 8.00

TerminologyandSystemInformation 8.50 8.00 3.50 8.00Learning 8.75 9.00 3.75 8.00

SystemCapabilities 7.60 9.00 3.60 7.20

REFERENCES

1.KuoT-T,RaoP,MaeharaC,etal.EnsemblesofNLPToolsforDataElementExtractionfrom

ClinicalNotes.AMIAAnnualSymposium,2016.

2.SavovaGK,MasanzJJ,OgrenPV,etal.MayoclinicalTextAnalysisandKnowledgeExtraction

System (cTAKES): architecture, component evaluation and applications. Journal of the

American Medical Informatics Association : JAMIA 2010;17(5):507-13 doi:

10.1136/jamia.2009.001560[publishedOnlineFirst:EpubDate]|.

3.AronsonAR,LangF-M.AnoverviewofMetaMap:historicalperspectiveandrecentadvances.

JournaloftheAmericanMedicalInformaticsAssociation2010;17(3):229-36

4.XuH,StennerSP,DoanS,JohnsonKB,WaitmanLR,DennyJC.MedEx:amedicationinformation

extraction system for clinical narratives. Journal of the AmericanMedical Informatics

Association2010;17(1):19-24

5. Clinical Language Annotation, Modeling, and Processing Toolkit (CLAMP).

http://clamp.uth.edu/(accessedFebruary6,2017).

6.PattersonOV,JonesM,YaoY,etal.ExtractionofVitalSignsfromClinicalNotes.Studies in

healthtechnologyandinformatics2014;216:1035-35

7.Garvin JH,DuVall SL, SouthBR,etal.Automatedextractionofejection fraction forquality

measurement using regular expressions in Unstructured Information Management

Architecture (UIMA) for heart failure. Journal of the American Medical Informatics

Association : JAMIA 2012;19(5):859-66 doi: 10.1136/amiajnl-2011-000535[published

OnlineFirst:EpubDate]|.

8.DoanS,MaeharaCK,ChaparroJD,etal.BuildingaNaturalLanguageProcessingTooltoIdentify

PatientsWithHighClinicalSuspicionforKawasakiDiseasefromEmergencyDepartment

Notes.AcademicEmergencyMedicine2016;23(5):628-36

9.SouthBR,ShenS,LengJ,ForbushTB,DuVallSL,ChapmanWW.Aprototypetoolsettosupport

machine-assistedannotation.Proceedingsofthe2012WorkshoponBiomedicalNatural

LanguageProcessing:AssociationforComputationalLinguistics,2012:130-39.

10.ByrdRJ,SteinhublSR,SunJ,EbadollahiS,StewartWF.Automaticidentificationofheartfailure

diagnostic criteria, using text analysis of clinical notes from electronic health records.

International Journal of Medical Informatics 2014;83(12):983-92 doi:

http://dx.doi.org/10.1016/j.ijmedinf.2012.12.005[publishedOnlineFirst:EpubDate]|.

11.ChenY,LaskoTA,MeiQ,DennyJC,XuH.Astudyofactivelearningmethodsfornamedentity

recognitioninclinicaltext.Journalofbiomedicalinformatics2015;58:11-18

12.LingrenT,DelegerL,MolnarK,etal.Evaluatingtheimpactofpre-annotationonannotation

speed andpotential bias: natural languageprocessing gold standard development for

clinicalnamedentityrecognitioninclinicaltrialannouncements.JournaloftheAmerican

Medical Informatics Association 2013;21(3):406-13 doi: 10.1136/amiajnl-2013-

001837[publishedOnlineFirst:EpubDate]|.

13. Ogren PV, Savova GK, Chute CG. Constructing evaluation corpora for automated clinical

namedentity recognition.Medinfo2007: Proceedingsof the12thWorldCongresson

Health(Medical)Informatics;BuildingSustainableHealthSystems:IOSPress,2007:2325.

14.SouthBR,MoweryD,SuoY,etal.Evaluatingtheeffectsofmachinepre-annotationandan

interactiveannotation interfaceonmanualde-identificationof clinical text. Journalof

biomedicalinformatics2014;50:162-72

15. Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for

reference corpusdevelopment. Journal ofBiomedical Informatics 2014;50:151-61doi:

http://dx.doi.org/10.1016/j.jbi.2013.12.014[publishedOnlineFirst:EpubDate]|.

16.EscudiéJ-B,JannotA-S,ZapletalE,etal.Reviewing741patientsrecordsintwohourswith

FASTVISU.AMIA,2015.

17.DuvallSL,ForbushTB,CorniaRC,etal.Reducingthemanualburdenofmedicalrecordreview

throughinformatics.PharmacoepidemiologyandDrugSafety2014;23:415

18.SavovaGK,ChapmanWW,ZhengJ,CrowleyRS.Anaphoricrelationsintheclinicalnarrative:

corpuscreation.JournaloftheAmericanMedicalInformaticsAssociation2011;18(4):459-

65

19. SL D, RC C, TB F, CH H, V. PO. Check it with Chex : A Validation Tool for Iterative NLP

Development. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA

Symposium,2014.

20. Chex. http://department-of-veterans-affairs.github.io/Leo/components.html - /Chex

(accessedFebruary6,2017).

21.StenetorpP,PyysaloS,TopicG,OhtaT,AnaniadouS,TsujiiJi.BRAT:aweb-basedtoolfor

NLP-assistedtextannotation.ProceedingsoftheDemonstrationsatthe13thConference

oftheEuropeanChapteroftheAssociationforComputationalLinguistics.Avignon,France:

AssociationforComputationalLinguistics,2012:102-07.

22. Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered Scalable National

Network for Effectiveness Research. Journal of the American Medical Informatics

Association2014;21(4):621-26

23.AronsonAR.EffectivemappingofbiomedicaltexttotheUMLSMetathesaurus:theMetaMap

program.Proceedings/AMIA...AnnualSymposium.AMIASymposium2001:17-21doi:

citeulike-article-id:300324[publishedOnlineFirst:EpubDate]|.

24. McCray AT, Trevvett P, Frost HR. Modeling the autism spectrum disorder phenotype.

Neuroinformatics2014;12(2):291-305

25.Twoapproachestointegratingphenotypeandclinicalinformation.AMIA;2009.

26. Feupe SF, Lin K, Kuo T-T, et al. Review and Evaluation of the State of Standardization of

ComputablePhenotype.AMIAAnnualSymposium,2016.

27. Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical

documents frommultiple data sources. Journal of the AmericanMedical Informatics

Association2011;18(5):580-87

28. Esuli A, Marcheggiani D, Sebastiani F. An enhanced CRFs-based system for information

extractionfromradiologyreports.Journalofbiomedicalinformatics2013;46(3):425-35

29.ChenQ,LiH,TangB,etal.Anautomaticsystemtoidentifyheartdiseaseriskfactorsinclinical

textsovertime.Journalofbiomedicalinformatics2015;58:S158-S63

30.DoanS,CollierN,XuH,DuyPH,PhuongTM.Recognitionofmedication informationfrom

discharge summaries using ensembles of classifiers. BMC medical informatics and

decisionmaking2012;12(1):1

31.KangN,AfzalZ,SinghB,VanMulligenEM,KorsJA.Usinganensemblesystemtoimprove

concept extraction from clinical records. Journal of biomedical informatics

2012;45(3):423-28

32.KimY,RiloffE.AStackedEnsembleforMedicalConceptExtractionfromClinicalNotes.AMIA

Joint Summits on Translational Science proceedings AMIA Summit on Translational

Science2015;2015

33.MTSamples.http://www.mtsamples.com/(accessedAugust19,2015).

34. Galitz WO. The essential guide to user interface design: an introduction to GUI design

principlesandtechniques:JohnWiley&Sons,2007.

35. Smith SL,Mosier JN.Guidelines for designing user interface software:Mitre Corporation

Bedford,MA,1986.

36.HolzingerA.Usabilityengineeringmethodsforsoftwaredevelopers.Communicationsofthe

ACM2005;48(1):71-74

37. A javascript library for building user interfaces react. https://facebook.github.io/react/

(accessedOctober17,2016).

38. Heart Failure Fact Sheet.

https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_heart_failure.htm (accessed

June12,2017).

39. About Kawasaki Disease. https://www.cdc.gov/kawasaki/about.html (accessed June 12,

2017).

40.UzunerÖ,LuoY,SzolovitsP.Evaluatingthestate-of-the-artinautomaticde-identification.

JournaloftheAmericanMedicalInformaticsAssociation2007;14(5):550-63

41. Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status frommedical

discharge records. Journal of the American Medical Informatics Association

2008;15(1):14-24

42.UzunerÖ.Recognizingobesity and comorbidities in sparsedata. Journalof theAmerican

MedicalInformaticsAssociation2009;16(4):561-70

43. Uzuner Ö, Solti I, Xia F, Cadag E. Community annotation experiment for ground truth

generation for the i2b2 medication challenge. Journal of the American Medical

InformaticsAssociation2010;17(5):519-23

44.UzunerÖ,SoltiI,CadagE.Extractingmedicationinformationfromclinicaltext.Journalofthe

AmericanMedicalInformaticsAssociation2010;17(5):514-18

45.UzunerÖ,SouthBR,ShenS,DuVallSL.2010i2b2/VAchallengeonconcepts,assertions,and

relations in clinical text. Journal of the American Medical Informatics Association

2011;18(5):552-56

46.UzunerO,BodnariA,ShenS,ForbushT,PestianJ,SouthBR.Evaluatingthestateoftheartin

coreferenceresolutionforelectronicmedicalrecords.JournaloftheAmericanMedical

InformaticsAssociation2012;19(5):786-91

47. SunW, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2

Challenge.JournaloftheAmericanMedicalInformaticsAssociation2013;20(5):806-13

48.StubbsA,UzunerÖ.Annotatinglongitudinalclinicalnarrativesforde-identification:The2014

i2b2/UTHealthcorpus.Journalofbiomedicalinformatics2015;58:S20-S29

49.StubbsA,KotfilaC,UzunerÖ.Automatedsystemsforthede-identificationof longitudinal

clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of

biomedicalinformatics2015;58:S11-S19

50.StubbsA,UzunerÖ.Annotatingriskfactorsforheartdiseaseinclinicalnarrativesfordiabetic

patients.Journalofbiomedicalinformatics2015;58:S78-S91

51. Stubbs A, Kotfila C, Xu H, Uzuner Ö. Identifying risk factors for heart disease over time:

Overviewof2014i2b2/UTHealthsharedtaskTrack2.Journalofbiomedicalinformatics

2015;58:S67-S77

52. JohnsonAE, Pollard TJ, Shen L, et al.MIMIC-III, a freely accessible critical care database.

Scientificdata2016;3

53.SuominenH,SalanteräS,VelupillaiS,etal.OverviewoftheShARe/CLEFeHealthevaluation

lab2013.InternationalConferenceoftheCross-LanguageEvaluationForumforEuropean

Languages:Springer,2013:212-31.

54.KellyL,GoeuriotL,SuominenH,etal.Overviewoftheshare/clefehealthevaluationlab2014.

International Conference of the Cross-Language Evaluation Forum for European

Languages:Springer,2014:172-91.

55. Goldberger AL, Amaral LA, Glass L, et al. Physiobank, physiotoolkit, and physionet

components of a new research resource for complex physiologic signals. Circulation

2000;101(23):e215-e20

56. Stearns MQ, Price C, Spackman KA, Wang AY. SNOMED clinical terms: overview of the

developmentprocessandprojectstatus.ProceedingsoftheAMIASymposium:American

MedicalInformaticsAssociation,2001:662.

57.McDonaldCJ,HuffSM,SuicoJG,etal.LOINC,auniversalstandardforidentifyinglaboratory

observations:a5-yearupdate.Clinicalchemistry2003;49(4):624-33

58. Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug

informationexchange.ITprofessional2005;7(5):17-23

59. Bodenreider O. The unified medical language system (UMLS): integrating biomedical

terminology.Nucleicacidsresearch2004;32(suppl1):D267-D70

60.CohenJ.Statisticalpoweranalysisforthebehavioralsciences.Hillsdale,NJ1988:20-26

61.ManningCD,SurdeanuM,BauerJ,FinkelJR,BethardS,McCloskyD.TheStanfordCoreNLP

NaturalLanguageProcessingToolkit.ACL(SystemDemonstrations),2014:55-60.

62.KukrejaU,StevensonWE,RitterFE.RUI:RecordinguserinputfrominterfacesunderWindows

andMacOSX.BehaviorResearchMethods2006;38(4):656-59

63.Morgan JH,ChengC-Y,PikeC,RitterFE.Adesign, testsandconsiderations for improving

keystrokeandmouseloggers.InteractingwithComputers2013:iws014

64.OpenBroadcasterSoftware(OBS)Studio.https://obsproject.com/(accessedJune8,2016).

65.DavisFD.Perceivedusefulness,perceivedeaseofuse,anduseracceptanceofinformation

technology.MISquarterly1989:319-40

66.ChinJP,DiehlVA,NormanKL.Developmentofaninstrumentmeasuringusersatisfactionof

thehuman-computerinterface.ProceedingsoftheSIGCHIconferenceonHumanfactors

incomputingsystems:ACM,1988:213-18.

67.ChapmanWW,BridewellW,HanburyP,CooperGF,BuchananBG.Asimplealgorithmfor

identifyingnegatedfindingsanddiseasesindischargesummaries.Journalofbiomedical

informatics2001;34(5):301-10

68.WebMD.http://www.webmd.com/(accessedAugust3,2015).

69.MedScape.http://www.medscape.com/(accessedAugust3,2015).

70.RxList.http://www.rxlist.com/(accessedAugust3,2015).

71.Drugs.com.http://www.drugs.com/(accessedAugust3,2015).

72.HealthNLP.http://healthnlp.github.io/examples/(accessedAugust19,2015).

Documents

The Impact of Automatic Pre-annotation in Clinical Note ... · Annotation in cNLP refers to the process of manually identifying the mentions of data elements of target signs, symptoms,