Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
TheImpactofAutomaticPre-annotationinClinicalNoteDataElementExtraction
—theCLEANTool
Tsung-TingKuo,PhD1,JinaHuh,PhD1,JihoonKim,MS1,RobertEl-Kareh,MD,MS,MPH1,
SiddharthSingh,MD1,StephanieFeudjioFeupe,MSc1,VincentKuri,MS2,GordonLin,MS2,
MicheleE.Day,PhD1,LucilaOhno-Machado,MD,PhD1,andChun-NanHsu,PhD1,*
1UCSDHealthDepartmentofBiomedicalInformatics,UniversityofCaliforniaSanDiego,La
Jolla,CA,USA
2DepartmentofComputerScienceandEngineering,UniversityofCaliforniaSanDiego,LaJolla,
CA,USA
*9500GilmanDr,SanDiego,CA92093,USA;[email protected];+1(858)822-4931.
Keywords: Natural Language Processing and Text Mining, Human-Computer Interaction,
UsabilityTesting,Pre-Annotation,ClinicalNoteDataElementExtraction
Wordcount:4,000
ABSTRACT
Objective. Annotation is expensive but essential for clinical note review and clinical natural
languageprocessing(cNLP).However,theextenttowhichcomputer-generatedpre-annotation
isbeneficialtohumanannotationisstillanopenquestion.OurstudyintroducesCLEAN(CLinical
note rEview and ANnotation), a pre-annotation-based cNLP annotation system to improve
clinical note annotation of data elements, and comprehensively compares CLEAN with the
widely-usedannotationsystemBratRapidAnnotationTool(BRAT).
Materials and Methods. CLEAN includes an ensemble pipeline (CLEAN-EP) with a newly
developed annotation tool (CLEAN-AT). A domain expert and a novice user/annotator
participatedinacomparativeusabilitytestbytagging87dataelementsrelatedtoCongestive
HeartFailure(CHF)andKawasakiDisease(KD)cohortsin84publicnotes.
Results.CLEANachievedhighernote-levelF1-score(0.896)overBRAT(0.820),withsignificant
differenceincorrectness(P-value<0.001),andthemostlyrelatedfactorbeingsystem/software
(P-value < 0.001). No significant difference (P-value 0.188) in annotation timewas observed
betweenCLEAN(7.262minutes/note)andBRAT(8.286minutes/note).Thedifferencewasmostly
associatedwithnotelength(P-value<0.001)andsystem/software(P-value0.013).Theexpert
reportedCLEANtobeuseful/satisfactory,whilethenovicereportedslightimprovements.
Discussion.CLEANimprovesthecorrectnessofannotationandincreasesusefulness/satisfaction
with the same level of efficiency. Limitations include untested impact of pre-annotation
correctnessrate,smallsamplesize,smallusersize,andrestrictedlyvalidatedgoldstandard.
Conclusion. CLEANwithpre-annotation canbebeneficial for anexpert todealwith complex
annotationtasksinvolvingnumerousanddiversetargetdataelements.
1. BACKGROUNDANDSIGNIFICANCE
Clinical notes with unstructured narrative, such as progress notes, radiology reports, and
dischargesummaries,areoneofthemostinformation-rich,under-utilizedsourcesofhealthcare
data.[1]Criticalaspectsofclinicalqualityareoftendescribedinthefree-textnotesofelectronic
health records (EHR) systems. These important aspects can be used to improve healthcare
delivery/management,clinical/translationalresearch,andultimatelypatienthealth.
Clinical Natural Language Processing (cNLP) is dedicated to developing tools and systems to
extractsuchusefulinformationfrommedicaltext.WidelyusedcNLPtoolsincludecTAKES(clinical
Text Analysis and Knowledge Extraction System),[2]MetaMap,[3]MedEx,[4] CLAMP (Clinical
LanguageAnnotation,Modeling,andProcessingToolkit),[5]Vitals,[6]EFEx,[7]KD-NLP,[8]anda
fewothers.[9-11]ThesecNLPtoolscanextractvarioustypesofinformationsuchasconditionand
medication.
AnnotationincNLPreferstotheprocessofmanuallyidentifyingthementionsofdataelements
of targetsigns, symptoms,events,etc. tobeextracted inclinicalnotes.Annotation,although
imperfect,isanimportantprocessthatprovides:(1)qualitycontrolforfinalcNLPoutputdata,
(2) gold standards to evaluate the performance of cNLP tools, and (3) training examples to
developandimprovecNLPtools.Specifically,trainingexamplesareessentialforthosecNLPtools
based on supervised machine learning, as well as for the development of rule-based cNLP
tools.[9-20]
Annotation,however,isalsothebottleneckofthewholedevelopmentprocessofcNLPtools,[21]
especially when numerous and diverse data elements are to be extracted.[1] From our
experience in the patient-centered SCAlable National Network for Effectiveness Research
(pSCANNER) project,[22] an experienced clinical annotator required an average of 15 to 30
minutes to annotate a clinical note for an annotation task involving the tagging of 41 data
elements.
Intuitively, pre-annotation by a cNLP tool before a manual review might help improve the
correctnessandefficiencyoftheannotationprocess.[12]Inpre-annotation,thementionsofthe
targetdataelementsare identifiedbya cNLP tool.Theseelements serveas suggestions toa
humanannotator,sothattheannotatorcanreviewandrevisethepre-annotatedmentions(“pre-
annotations”)insteadofstartingtheannotationprocessfromscratch.However,previouscNLP
studiesofpre-annotation showedmixedand inconsistent results in termsof correctnessand
efficiencyontasksthatincludednameentityrecognition,[1213]de-identification,[1415]patient
recordschartreview,[1617]corpuscreation,[18]andNLPoutputvalidation,[1920].Noneofthe
studies considered the tailored design of a user interface (UI) to take advantage of pre-
annotation.Inthestudyofnameentityrecognition,someauthorsreportedpositiveresultsof
pre-annotation[12]:“Timesavings[ofpre-annotations]rangedfrom13.85%to21.5%perentity.
Inter-annotator agreement (IAA) ranged from 93.4% to 95.5%. … The time savings were
statistically significant. Moreover, the pre-annotation did not reduce the IAA or annotator
performance.”However, other authors provided contrasting outcomes [13]: “We found little
benefitto[pre-annotate]thecorpuswithathird-partynameentityrecognizer…theannotators
(whowere]giventhe[MetaMapTransfer(MMTx)tool,nowMetaMap][23]annotations,A1and
A3,annotatedslowerthantheothertwoannotators,A2andA4.…Therewasalsonocleartrend
that the MMTx annotation improved pair-wise IAA between individuals.” Most studies only
focusedonalimitednumberoftargetdataelementstobeextracted.Forexample,theauthors
ofthesestudies[12][13]named5and10concepts,respectively.
2. OBJECTIVE
Ourgoalwastotakeadvantageofpre-annotationtoimprovetheannotationprocessofclinical
noteswhennumerousanddiversedataelementsthatarenecessarytosupportphenotypingand
cohort identification were involved. In other words, we sought to design, implement and
evaluateasoftwaresystemthatleveragespre-annotationtoimprovetheannotationqualityof
clinicalnoteswithalargenumber(>50)oftargetdataelements.Thissituationcommonlyoccurs
incNLPusecasesforphenotypingandcohortidentification.[24-26]Tothisend,wedeveloped
theCLinicalnoterEviewandANnotation(CLEAN)cNLPsystemandcomprehensivelycompared
CLEANwiththewidely-usedannotationsystemBratRapidAnnotationTool(BRAT).[21]
3. MATERIALSANDMETHODS
3.1TheCLEANcNLPSystem
Theoverall architectureof theCLEANcNLP system is shown in Figure1. The input toCLEAN
includesa studyplan specifying the targetdisease/condition, categorizeddataelements,and
criteriausedtoselectasetofclinicalnotesfromaninstituteelectronichealthrecord(EHR)data
warehouses.SinceclinicalnotescontainPHI,wedeposit them inasecure,privacy-preserving
computationenvironmentthatalsostoresdataelementsdefinitionsandcategories.
TheCLEANensemblepipeline(CLEAN-EP)pre-annotatestheclinicalnotesusingUnion,[1]which
extracts the data element if any of the cNLP tools identifies the data element and uses the
ensemblemethodtointegratefourcNLPtools(cTAKES,[2]MetaMap[3],MedEx,[4]andKD-NLP
[8]),aswellasournewlydevelopedcNLPtoolROCK(RulesforObesity,Congestiveheartfailure,
andKawasakidisease)describedinAppendixA.1.WepreviouslydiscoveredthatUnionshowed
thehighestrecall/sensitivityamongalloftheensemblemethodsinourpilotstudy[1].Thus,when
using Union as the ensemble method, annotators are provided with a high-coverage pre-
annotateddataelementmentionsinaclinicalnotetext,insteadofsearchingthroughthefulltext
toidentifyanymissingelements.Thatis,annotatorscanfocusoncorrectingthepre-annotations,
insteadofaddingthemissingmentions.AbenefitofapplyinganensemblepipelinelikeCLEAN-
EPisimprovedsystemflexibilitytoreuseandsharecNLPtools.[27-32]DetailsaboutCLEAN-EP
aredescribedinAppendixA.1.
Figure1.TheCLinicalnoterEviewandANnotation(CLEAN)clinicalNaturalLanguageProcessing
(cNLP)System.Afterthestudyplanwasconfirmed,theinputsweretheselecteddataelements
and thequeried clinical notes from theelectronichealth record (EHR)datawarehouse,both
storedonaVirtualMachine(VM).Then,thebuilt-inCLEANensemblepipeline(CLEAN-EP)pre-
annotatedthenotes,followedbytheuser’sreviewusingtheCLEANannotationtool(CLEAN-AT).
Theoutputswerethereviewedannotationsofthedataelementsontheclinicalnotes,andCLEAN
storedthefinalresultsbackontheVM.
TheCLinicalnote rEviewandANnotation (CLEAN)clinicalNaturalLanguageProcessing(cNLP)System
StudyPlan EHRDataWarehouse
ClinicalNotes
………........
………..........
.....
......
.....
DataElements
VM
CLEANEnsemblePipeline
(CLEAN-EP)
Pre-AnnotatedClinical Notes
………...........
………..........
......
.......
.......
..
..
..
..
CLEANAnnotation
Tool(CLEAN-AT)
ReviewedAnnotations
………...........
………..........
......
.......
...........
..
....
Query Export
TransferExecute Pre-Annotate Convert Review
Save
SaveSelect
Load
Finally,anannotatorwillreviewandcorrectthepre-annotationsusingtheCLEANannotationtool
(CLEAN-AT).CLEAN-ATsavesthefinalresultingannotationsandallowsannotatorstore-check
andrefinetheannotationsifrequired.DetailsofCLEAN-ATaregiveninSection3.2.
3.2TheCLEANAnnotationTool(CLEAN-AT)
TheUIdesignoftheCLEAN-ATisillustratedinFigure2,withthefollowingmainfeatures:
• Annotationeditor(rightpanel).ThismainworkingspaceofCLEAN-ATshowstheclinical
noteaswellasallannotationsonthenote.Ausercanedittheannotationsandreview
resultsonthispanel.Inthisstudy,thetermsuserandannotatorareusedinterchangeably.
• Dataelementbrowser(leftpanel).Clickingthedataelementorcategoryintheleftpanel
quicklyidentifiesallannotatedmentionsofthatdataelementorcategoryoftheclinical
noteshownintheannotationeditor. Inthisexample(Figure2),“NatriureticPeptides”
wasclickedinthedataelementbrowsertoidentifythesynonymmention“BNP”inthe
clinicalnotetext.Conversely, intheclinicalnote,thementionscouldbeclicked inthe
annotationeditortohighlightthecorrespondingdataelementinboldinthedataelement
browser.
Figure2.TheCLEANannotationtool(CLEAN-AT).Theleftpanel(dataelementbrowser)showsall
thedataelements,whiletherightpanel(annotationeditor)isforannotationadding,deleting,
andmodifyingviathepop-upmenu(dataelementreminder).Thebottompanel(functionbuttons)
allowsuserstoundo/redoediting,saveprogress,savethenoteascomplete,andproceedwith
thenextnote.TheclinicalnoteshownhereisfrompubliclyavailableMTSamplenotes.[33]
DataElementBrowser Annotation Editor
FunctionButtons
DataElementReminder
• Dataelementreminder(apopupmenuwhiletaggingthetext).Afterright-clickingona
selectedmentionintheannotationeditor,apopupmenuappearswithalistofalldata
elementsgroupedbythecategories.Ausercanusethekeyboardasanauto-completion
shortcuttofilterpossibledataelementsandthuscanspeeduptheannotationediting
process.FortheexampleshowninFigure2,iftheusertypes“n,”CLEAN-ATwoulddisplay
“NatriureticPeptides”and“NT-proBNP,”thedataelementsofwhichnamestartswith“n.”
Theusercan thenchoose fromthe filtereddataelements toannotateor re-annotate
quickly.
• Undo/redo (function buttons in the bottom panel). CLEAN-AT supports an unlimited
number of undo and redo operations, which allows users to edit and delete quickly,
knowing that they could easily recover from mistakes. Note that, deletions are also
considered as an operation, and therefore user can undo/redo deletions. Supporting
undo/redooperationsfollowsanimportantprincipleofuserinterfacedesigntoreduce
distressofnewusers.[34-36]However, fewexistingannotationtools forcNLPsupport
undo/redo,becausetheprogrammustrecordeveryuserinteraction.Also,theprogram
must maintain a stack data structure, to save the entire history of previous user
interactionsthatneedtobeaccessibleanytimetheuserattemptstoundo/redothelast
operations.Ifthesoftwaredidnotincludetheundo/redofunctionintheinitialdesign,it
would be difficult to add this feature because the whole programmight need to be
thoroughly rewritten. CLEAN-AT implements this important feature based on the
JavaScriptlibraryReact.[37]
• Overviewofthecompletionstatusofclinicalnotes.AsshowninFigureA.2inAppendixA,
theusercanstartanewannotationprocess,continueanincompleteannotationprocess,
orrecheckapreviouslycompletedannotationprocess, foranyclinicalnote inatarget
corpus.Althoughtheoverviewpageshowstheoverallinformationforallnotes,CLEAN-
ATbringsupthenextnoteforannotationaftertheusercompletestheannotationofa
clinical note, instead of bringing up this overview page. This featurewas designed to
effectivelykeepauserengaged.
3.3StudyMaterial
WeevaluatedtheperformanceofCLEANusingtwocohorts:(1)CongestiveHeartFailure(CHF)
asanexemplarofamajorhealthconditionintheU.S.,wheretheprevalenceofheartdiseaseis
5.7M;[38] and (2)KawasakiDisease (KD),which represents a rare yet acutediseasewithan
estimatednumberofU.S.hospitalizationsof5,447 in2009.[39]Both conditionsareuse-case
conditionsforthepSCANNERclinicaldataresearchnetwork.[22]Weusedtheclinicalnotepool
andtargetdataelementsdescribedinourpreviouswork[1]withnewlyannotatedgoldstandards,
whicharefurtherdescribedlaterinthissubsection.
Our pool of clinical notes contained notes from public domain datasets,[1] including MT
Samples,[33]i2b2Challenges2009–2012and2014,[40-51]ShAReCLEFeHealthTasks2013Task
1and2(whicharenowpartoftheMIMICIIIclinicaldatabase[52])and2014Task1.[53-55]We
collectedthecorpusofnotesforCHForKDbyselectingnotesthatcontained“congestiveheart
failure”or “KawasakiOR (feverAND rashAND redANDchild),” respectively.BecauseKD isa
relativelyraredisease,[839]weincludedMTSampleclinicalnotescontaining“fever”toincrease
thecorpussizeforKD.Theresultingcorpusconsistedof635notesforCHFand33notesforKD.
In our previous study,[1] CHF and KD subject matter experts had identified the target data
elements,whichwere thenmapped to standardized concept IDsdefined in SNOMED-CT,[56]
LOINC,[57]RxNorm,[58]orUMLS.[59]Aftermapping,thenormalizedoutputdataelementsof
thecNLPtoolswerereadytoserveasinputstotheCLEAN-EPensemblepipeline.
Foreachclinicalnote,anexperiencedphysicianannotatedthementionsofeachdataelementas
thegoldstandard.Duringtheannotationprocess,thephysicianfollowedthecNLPannotation
guidelineswedeveloped.ThephysicianusedtheCLEAN-EPpre-annotationsandtheCLEAN-AT
annotationtooltoannotatethegoldstandarddataelements.
3.4StudyDesign
ThetargetusersofCLEANareclinicalresearchers,front-lineclinicians,andtheirsupportingstaff
members,whowouldliketoidentifyacohortofpatientsfromanEHRsystemfortheirscientific
and/orqualityimprovementprojects.Therefore,ourinclusioncriteriaforatestuserincluded:
(1)anemployeeorastudentofUCSDwhowas³18yearsold,and(2)whohadworkedorwas
soongoingtoworkoncohortidentificationusingannotatedclinicalnotes.Basedonthesecriteria,
twoapprovedtestuserswithrequiredtrainingcertificates(HIPPA1andCITI2)wereselectedto
participate inour study: apracticingphysicianwith clinical training (adomainexpert), anda
graduatestudentintheDepartmentofBiomedicalInformaticswithbasicbiomedicalknowledge
(adomainnovice).Ourannotationtasksinvolvedkeyboard,mouse,audioandvideorecording
topreciselylogallinteractions,withtimestamps,betweenthetestuserandthesoftware.The
InstitutionalReviewBoard(IRB)atUCSDapprovedthisstudywithProjectNumber160410on
April21,2016.
Two-samplepairedt-testwasusedtocomparethemeandifferenceincorrectnessandefficiency
measured with two systems on the same clinical notes. The assumed effect size was 0.5,
computedfrommeanofthecorrectnessmetric(F1-scoreinourstudy)(0.8and0.7forCLEAN
andBRAT,respectively),andstandarddeviation0.2forbothsystems,basedontheexperimental
resultsofourpreviouswork.[1]Theestimatedstatisticalpowerwas78%,witheffectsize0.5,
significance level 0.05 and per-group size 32.[60]Our study had the required sample size to
evaluateCLEANandBRAT,withsignificantstatisticalpower,at least32clinicalnotesforeach
cohort.
Basedontheestimationofsamplesize,Figure3illustratesourprocesstoselecttheclinicalnotes
inthetestdatasetsfromthecorpusofnotes,whichcontained635notesforCHFand33notes
forKD.First,32KDnoteswererandomlyselected,suchthatthenotescouldbedividedequally
1HealthInsurancePortabilityandAccountabilityAct2CollaborativeInstitutionalTrainingInitiative
betweenCLEANandBRAT.Then,52clinicalnotesfortheCHFcohortwererandomlyselected,
foratotalof84clinicalnotesasthetestdatasetforourcomparativestudies(Table1).
The subjectmatter experts had enumerated 87 data elements for the CHF (50) and KD (37)
cohorts.[1]Within the84 testnotes, theexperiencedphysicianannotated1,542mentionsof
these87dataelementsasourgoldstandard.Wesplitthenotesinthetestdatasetinto6,702
sentencesusingtheStanfordCoreNLPlibrary.[61]Thestatisticsofthetestdatasetareshownin
thelastrowofTable1.
Figure3.Theclinicalnotetestdatasetselectionprocessforcongestiveheartfailure(CHF)and
Kawasakidisease(KD)conditions.
Table1.Statisticsofthetotaltestdataset(thelastrow)andthetwotestsets(ID=1and2)of
clinicalnotes.Onetestset(ID=1)wasusedtoevaluateBRAT,whiletheother(ID=2)wasused
toevaluateCLEAN.Conceptfrequencyindicatesthetotalnumberofgoldstandardannotations
intheclinicalnotesinatestset.
TestSetID
EvaluatedSoftware
Notes Words AverageWordsperNote
StandardDeviationWordsperNote
ConceptFrequency
1 BRAT 42 44,219 1,053 676 7242 CLEAN 42 48,998 1,167 706 818
Total(TestDataset) 84 93,217 1,110 689 1,542
PublicdomainclinicalnotesCollectedfromMTSamples,i2b2Challenges2009–2012and2014,ShAReCLEFeHealthTasks2013Task1and2,and2014Task1.N=6,135
NotesfortheCHFcohortFilteredusing“congestiveheartfailure”.
N=635
NotesfortheKDcohortFilteredusing“KawasakiOR(feverANDrashANDredANDchild)”,aswellasmanualreviewusing“fever”fortheMTSamplesdataset.N=33
Samplednotesfor theCHFcohortSampledrandomlyforatotalof≥32clinicalnotestoachieve78%statisticalpowerwithalevelofsignificanceof0.05andeffectsizeof0.5.N=52
Samplednotesfor theKDcohortSampledrandomly foratotalof≥32clinicalnotestoachieve78%statisticalpowerwithalevelofsignificanceof0.05 andeffectsizeof0.5.N=32
SamplednotesincomparativestudiesCombinedasthetestdataset.N=84
ClinicalNotePool
ClinicalNoteCorpus
PerCondition
ClinicalNoteTestDatasetPerCondition
ClinicalNoteTestDataset
3.5EvaluationProtocol
The84clinicalnotesinthetestdatasetwererandomlydividedintotwotestsets,eachwith42
notes(26CHFand16KD).OnetestsetwasusedtoevaluateBRAT,whiletheotherwasusedto
evaluateCLEAN.Table1describesthestatisticsofthetwotestsets.
Bothusersevaluatedeachsoftwarewiththesametestsetandsameorderofclinicalnotes,but
theorderofthesoftwareevaluatedwasrandomlyassigned.Therandomizationassigneduser-1
toannotatethenotesusingCLEANfirstandthenBRATforCHF(butBRATfirstforKD),whileuser-
2annotatedthenotesusingBRATfirstthenCLEANforCHF(butCLEANfirstforKD).
Each evaluation session took about 1.5 to 2 hours long. Before the annotation, both users
attendeda20-minutetrainingsessiononthecNLPannotationguidelines,dataelementtables,
and the usage of each annotation software. The users could rest at any time during the
annotationprocessorcompletelyceasethesessionasneeded.
3.6EvaluationMeasurements
ThecomparativestudyofCLEANandBRATexploitedthreetypesofmeasurements:correctness,
efficiency, and usefulness/satisfaction,which are discussed in Section 3.6.1, 3.6.2, and 3.6.3,
respectively.
3.6.1 CorrectnessMeasurements
Thegoldstandardsallowedustoassessthequalityoftheuser-reviewedannotations.Forthis
purpose,ourevaluationmetricwasF1-score,computedundernote-levelandsentence-level.The
detaileddefinitionsofthenote-levelandsentence-levelF1-scoresaredescribedinAppendixA.2.
Twomixed-effectsmodelswereconsideredforeachresponsevariableofthenote-andsentence-
level F1-score, with the computation of intra class correlation (ICC). Stepwise backward
eliminationmethodwasappliedtoselectvariablesamongalistofcandidatevariablesconsisting
ofsoftware(CLEANversusBRAT),condition(CHFversusKD),length(wordcountoftheclinical
note), andconcept frequency (numberof thegold standardannotations in theclinicalnote),
whilehavingreviewer(graduatestudentversusphysician)asarandomeffect,withasignificance
levelof0.05.
3.6.2 EfficiencyMeasurements
The main metric of efficiency was the average annotation time, in minutes, to finish an
annotationtaskforeachclinicalnote.Amixed-effectsmodel,similartothemodelsdescribedin
Section 3.6.1, was considered for each response variable of the annotation time, with the
computationofICC.Inaddition,weconsideredthetotalnumberofkeyboardpressesandmouse
clicks to better understand the user activities required to operate the software. The
keyboard/mouseinteractionsandtheaudio/videorecordingswereloggedusingtheRUItool[62
63]andtheOBSStudiotool,[64]respectively.Basedonourassumptionthatthekeyboardpresses
andmouseclickswereproportionaltothelengthofaclinicalnote,countswerenormalizedper
wordtoserveasourmetricforuseractivities.
3.6.3 Usefulness/SatisfactionMeasurements
This measurement consisted of scale surveys and qualitative interviews. The usability scale
surveywasa7-point(1–7Likertscale,7beingbest)questionnaire,[65]with6questionsrelated
toPerceivedUsefulnessand5questionsrelatedtoPerceivedEaseofUse.Thesatisfactionscale
surveywasa10-point(0–9Likertscale,9beingbest)questionnaire,[66]with6questionsfor
TerminologyandSystemInformation,6questionsforLearning,6questionsforOverallReaction
totheSoftware,4questionsforScreen,and5questionsforTerminologyandSystemInformation.
Aftereachannotationsession,theusersfilledoutthesurveysonthesystem,andexpressedany
commentstheyhadaboutcriticalevents,suchasnotbeingabletofindabutton,inaninterview
meeting.
4. RESULTS
4.1CorrectnessResults
TheresultsofthecorrectnesscomparisonofthetwosystemsareshowninTable2.Ingeneral,
CLEANimprovedtheprecision,recallandF1-scoreforbothnote-andsentence-levelevaluations.
The comparison results using boxplots (Figure 4 (a) and (b)) further indicated that CLEAN
improvedthecorrectnesssignificantly,withresultingP-values<0.001and0.004fornote-and
sentence-level,respectively.
The F1-scores of the cNLP pre-annotation are given below. For note-level, the cNLP pre-
annotationF1-scoreswere0.695fortheclinicalnotesinBRATand0.791forthenotesinCLEAN.
Forsentence-level,thecNLPpre-annotationF1-scoreswere0.529fortheclinicalnotesinBRAT
and0.500forthenotesinCLEAN.Thatis,theF1-scoreofCLEANimprovedfrom0.791to0.896
innote-level,andfrom0.500to0.719forsentence-level,aftertheuserreviewviatheCLEAN-AT.
Notethat, inourexperiments,CLEANexploitedthecNLPpre-annotatedmentionswhileBRAT
didnot.ItshouldalsobenotedthattheKDclinicalnotesincludedanoutlierdatapoint,which
generatedzeroprecision,recallandF1-scoreandwasreplacedbyusinginterpolation.
Figure4.(a)Boxplotforthenote-levelF1-score.(b)Boxplotforthesentence-levelF1-score.(c)
Boxplotforannotationtime.
Table2.ComparisonofthegoldstandardwithannotationresultsfromBRATandCLEAN.The
metricswerenote-levelandsentence-levelaveragedprecision, recall andF1-score,with95%
confidenceinterval(95%CI).
Level Software Precision(95%CI) Recall(95%CI) F1-score(95%CI)
NoteBRAT 0.855(0.826to0.885) 0.816(0.780to0.852) 0.820(0.794to0.846)CLEAN 0.895(0.870to0.920) 0.913(0.890to0.936) 0.896(0.876to0.916)
SentenceBRAT 0.630(0.582to0.677) 0.614(0.565to0.664) 0.616(0.568to0.665)CLEAN 0.727(0.689to0.764) 0.723(0.684to0.763) 0.719(0.681to0.757)
●
●
●
●
●
●
●
n=84
n=84
0.6
0.8
1.0
BRAT CLEANSoftware
F1no
te
SoftwareBRAT
CLEAN
P value = 3.7e−06
●
●
●
n=84
n=84
0.00
0.25
0.50
0.75
1.00
BRAT CLEANSoftware
F1se
nten
ce SoftwareBRAT
CLEAN
P value = 0.003566
●
●
●
●
●
●
●
●
●
n=84n=84
0
10
20
30
40
BRAT CLEANSoftware
Tim
e (m
inut
es)
SoftwareBRAT
CLEAN
P value = 0.1875118
(a)Note-levelF1 (b)Sentence-levelF1 (c)TimeinMinutes
Finally,theintraclasscorrelation(ICC)withinanannotatorwas0.263and0.272fornote-and
sentence-levelF1-score,respectively.Thefinalmixed-effectsmodelsshowedthatbothlevelsof
F1-scoreweremostlyrelatedtosoftwaredifferences,withap-value<0.001.Thenote-levelF1-
scorewas also related to concept frequency and condition. The details of themixed-effects
modelingresultsareshowninTable3.Themodelplotsofnote-andsentence-levelF1-scoreare
shown in Figure 5 and A.3 in Appendix A, respectively. Under note-level, the F1-score
improvementofCLEANoverBRATwasmoresalientandthep-valuewasalsomuchlower.
Figure5.Theresultsofthelinearmixed-effectsmodelusingnote-levelF1-scoreasresponseand
annotatorsasrandomeffect.Thesoftware,conceptfrequencyandconditioneffectsareincluded
accordingtothehighestrelevancyshowninTable3.
●
●
●
● ●
●
●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
● ●
●
GradStud MD
CHF
KD
2 4 6 2 4 6
0.6
0.8
1.0
0.6
0.8
1.0
ConceptFreq
F1note Software
●
●
BRAT
CLEAN
Table3. Final linearmixed-effectsmodelusingnote- and sentence-level F1as response, and
annotatorsasrandomeffect.Themostrelevantfactorsareshowninboldtext.
Level Effect Estimate StandardError P-value
Note
(Intercept) 0.725989 0.043538 0.000001Software 0.070426 0.014759 0.000004
ConceptFrequency 0.020996 0.007027 0.003235Condition 0.037190 0.015439 0.017095
Sentence(Intercept) 0.615911 0.061387 0.006664Software 0.102887 0.028570 0.000418
4.2EfficiencyResults
The average annotation time inminutes-per-note for BRATwas 8.286 (with 95%Confidence
Intervalfrom6.788to9.783),whilethetimeforCLEANwas7.262(with95%ConfidenceInterval
from5.859to8.665).Thecomparisonresultsusingboxplots(Figure4(c))showedthatCLEAN
improvedtheannotationtime,butnotsignificantly(p=0.188).
Also, the ICC within an annotator was 0.418. The final mixed-effects model indicated that
annotationtimewasmostlyrelatedtoclinicalnotelengthwithP-value<0.001.Thesecondmost
significanteffectwasthesoftwaredifferencewithP-value=0.013.Furthermore,theannotation
timewasrelatedtoconditionandconceptfrequency.ThedetailsofthemodelareshowninTable
4andFigure6,whilethedetailedtimeanalysisresultsareshowninTableA.1inAppendixA.
Figure6.Theresultsofthelinearmixed-effectsmodelusingannotationtimeasresponseand
annotatorsasrandomeffect.Thelength,software,andconditioneffectsareincludedaccording
tothehighestrelevancyshowninTable4.
●
●
●
●
●●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●●
●●
●●
●
●●
●●
●
●
●●
●
●●
●
●
●● ● ●
●
●
●●
●
●
●
●●
● ●●
● ●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●●
●
●
● ●
●
●
● ●●
●●
●
GradStud MD
CH
FKD
8 9 10 11 8 9 10 11
0
10
20
30
40
0
10
20
30
40
Length
Tim
e (m
inut
es)
Software●
●
BRAT
CLEAN
Table4.Finallinearmixed-effectsmodelusingannotationtimeasresponseandannotatorsas
randomeffect.Thetoprelevantfactorsareshowninboldtext.
Effect Estimate StandardError P-value(Intercept) -26.888250 4.491174 0.000007Software -1.716550 0.685926 0.013298Length 3.346265 0.482188 0.000000
ConceptFrequency 0.851998 0.429080 0.048720Condition -1.812721 0.832729 0.030905
Thestatisticsofuseractivities,includingkeyboardpressesandmouseclicks,areshowninTable
A.2 in Appendix A. The averaged-and-normalized user activities for BRATwas 0.094 (or 10.6
words reviewed per press/click), and for CLEAN, it was 0.076 (or 13.2 words reviewed per
press/click).Therefore,CLEANreducedbothnormalizedmouseclicksandnormalizedkeyboard
presses.Also,FigureA.4inAppendixAillustratestheresultsperannotator,showingthatCLEAN
decreasedthenormalizeduseractivitiescomparedtoBRATforbothannotators.
4.3PerceivedUsefulnessandUserSatisfactionResults
Thesummaryof thesurveyresults is shown inTableA.3 inAppendixA,while the full survey
resultsareshowninFigureA.5inAppendixA.Theinterviewresults,includingcriticaleventsand
overallfeedbackofBRATandCLEAN,aredescribedinAppendixA.3.Ourfindingsbasedonthe
questionnaireresultsandinterviewsareasfollows:
• PerceivedUsefulnessandEaseofUse.Thephysicianannotatorfoundsuperiorusefulnessof
CLEANover BRAT,with average difference of 3.600 in satisfaction score values over two
categories,andthegraduatestudentannotatorreportedinsignificantresultswithaverage
differenceof0.215.
• User Interface Satisfaction. While the physician annotator perceived a higher level of
satisfactionof CLEAN compared toBRAT,with averagedifferenceof 4.204 in satisfaction
score values over five categories, the graduate student annotator showed a modest
difference(0.364).
TheadvantagesofBRATreportedbytheannotatorsincludedlowlearningcurveofthetool,the
ability to easily navigate to next notes to annotate, and the auto-save feature of the work
progress.However,theBRATinterfaceshowedinefficiencyasasystem:thefeedbackspeedto
userinteractionwasslowandfindingadataelementforannotationrequiredmanyclicks.The
participantshad toclickmultiple times toselecta text,oftengivingunwanted text selection,
making them cancel the operation and restart the task. The text position often shifted after
annotations,makingitdifficultfortheparticipantstoreorientwhereintheannotationprocess
theywereworkingon.
ThereportedbenefitsofCLEANincludedquicklearningcurveandeaseofuse,similartoBRAT.
However, the participants saw pre-annotation, whichwas a unique feature to CLEAN, to be
helpfulandfacilitatingoftheirannotationprocess.Thetoolallowedeasyadditionordeletionof
annotations, providing immediate feedback to each user interaction, leading to efficient
annotationtasks.TheparticipantsalsosawCLEANtohaveapotentialtobeusedasapredictive
analysistool.TheparticipantsalsonotedtheimprovementsthatcanbemadewithCLEANasit:
allowssingle-clickwordselection,hasmorekeyboard-assisteddataelementselections,enables
afontsizeadjustmentfeature,andincludesmoremedicationsynonyms.
5. DISCUSSION
Basedontheresultsfromourcomparativestudy,wefoundthat,forthetwotesters,CLEANcould
improve correctness (precision, recall and F1-score) of annotation and increase annotation
quality.Especially,CLEANimprovednote-levelF1-scoreaccordingtothecorrespondingmixed-
effectmodel.CLEANcanalsoremainedatthesamelevelofannotationefficiency.Therewasno
significantdifferenceinthenote-levelprecision,astheconfidenceintervalsofBRATandCLEAN
wereoverlapping.However,forthesentence-level,thedifferenceinprecisionwassignificant.
Also, perceived efficiency of the tool was higher for CLEAN than for BRAT according to the
interviewresults.Therefore,forhigh-granularityannotationtasksthatlookintoeachsentence,
CLEANcanhelptheannotatorsprovidemorepreciseannotationresults.
CLEANcouldalsolowerthenumberofrequireduseractivitiesanddecreasetheannotationtime
for a physician annotator, but the time decrement was not significant in general. From the
surveysandinterviews,thephysicianannotatorreportedmoreusefulness/satisfactionforCLEAN
comparedtothegraduatestudent.Wehypothesizethatthismaybethecasethatthestudent
annotatorneedstospentsubstantialamountoftimelookinguponpotentialkeywordcategories
andconfirmingmanyofcNLPpre-annotationmentions.
Onelimitationofourstudyisthatwedidnottesthowthepre-annotationcorrectnessrateaffects
auser’sexperience.However,wepredictthatwithmoreaccuratecNLPpre-annotationresults,
theuserexperienceandannotationqualitywillimprove.Inadditiontothatthesmallnumberof
notes,ourexperimentincludedonlyoneuserofeachtype(domainexpertornovice)andthus
theseresultsmaynotbegeneralizedwelltoalluserswithinatype,suchasallgraduatestudents.
Finally,ourcurrentgoldstandardannotationswerecreatedbyoneannotatorandmayrequire
furthervalidation.
6. CONCLUSION
ThisstudyevaluatedtheCLEANcNLPsystem,whichincludedCLEAN-EPtoautomaticallygenerate
pre-annotationsandCLEAN-ATforannotatorstoreviewthemachine-generatedpre-annotations.
The study compared CLEAN with BRAT and found that CLEAN demonstrated improved
correctness and better usefulness/satisfaction, especially for a physician annotator, while
retainingthesamelevelofefficiency.CLEANcouldhelpaddressthebottleneckofannotationin
thecNLPconceptextractionpipelinetosupportphenotypingandcohortidentification.
ACKNOWLEDGEMENT
Partofthede-identifiedclinicalrecordsusedinthisresearchwereprovidedbythei2b2National
CenterforBiomedicalComputingfundedbyU54LM008748andwereoriginallypreparedforthe
SharedTasks forChallenges inNLPforClinicalDataorganizedbyDr.OzlemUzuner, i2b2and
SUNY. The computational infrastructure was provided by the iDASH National Center for
Biomedical Computing funded by U54HL108460 and managed by the Clinical Translational
Research Institute CTSA Informatics team led by Antonios Koures, PhD, funded in part by
UL1TR001442.
FUNDINGSTATEMENT
Tsung-TingKuo,JinaHuh,RobertEl-Kareh,GordonLin,MicheleE.Day,LucilaOhno-Machado,
andChun-NanHsuarepartiallyfundedthroughaPatient-CenteredOutcomesResearchInstitute
(PCORI)Award(CDRN-1306-04819).Thestatementsinthisarticlearesolelytheresponsibilityof
the authors and do not necessarily represent the views of PCORI, its Board of Governors or
MethodologyCommittee.Partof thede-identifiedclinical recordsused in this researchwere
providedbythei2b2NationalCenterforBiomedicalComputingfundedbyU54LM008748and
wereoriginallypreparedfortheSharedTasksforChallengesinNLPforClinicalDataorganizedby
Dr.OzlemUzuner,i2b2andSUNY.ThecomputationalinfrastructurewasprovidedbytheiDASH
NationalCenterforBiomedicalComputingfundedbyU54HL108460.
COMPETINGINTERESTSSTATEMENT
Theauthorshavenocompetingintereststodeclare.
CONTRIBUTORSHIPSTATEMENT
T-TKdesignedand implemented the system, conducted literature review, collected thedata,
developed the annotation guideline, provided training sessions, performed experiments,
analyzedtheresults,anddraftedthemanuscript.JHprovidedfeedbacksonthestudyandsystem
design, suggested critical directions for efficiency, usability and satisfactory evaluations,
performedexperiments,andeditedthemanuscript.JKprovidedfeedbackonthestudydesign,
conducted sample size estimation, performed the mixed-effects model analysis, provided
insightstopresenttheresults,andeditedthemanuscript.RE-Kprovidedfeedbackonthestudy
andsystemdesign,developedtheannotationguideline,annotatedthegoldstandards,provided
criticaldiscussionpoints,andeditedthemanuscript.SSandSFFprovidedfeedbackonthestudy
andsystemdesign,developedtheannotationguideline,evaluatedthesoftwaresystems,made
suggestions for system efficiency, usability and satisfaction improvement, and edited the
manuscript. VK and GL implemented the system and edited the manuscript. MED provided
feedbacksontheidea,andeditedthemanuscript.LO-Mwasprincipalinvestigatoroftheproject;
providedoverallsupervisionoftheprojectandcriticaleditingofthemanuscript.C-NHprovided
theoriginalidea,annotationguidelinesuggestions,criticaldiscussionpoints,overallsupervision
ofthestudyandcriticaleditingofthemanuscript.
APPENDIXA
A.1TheCLEANEnsemblePipeline(CLEAN-EP)
TheCLEANclinicalnaturallanguageprocessing(cNLP)systemconsistsofanensemblepipeline
(CLEAN-EP)asshowninFigureA.1.CLEAN-EPwasdescribedinourpreviouswork.[1]Theinputs
ofthepipelineweredataelementsandclinicalnotesdeterminedbythestudyplan,whilethe
outputswerethepre-annotatedclinicalnotesstoredonavirtualmachine(VM)readyforreview.
Thepipelineconsistsoffoursteps:
• Preprocessor.CLEAN-EPconvertstheclinicalnotestoplaintext,transformsthemtoUTF-8
encoding,andsplitsthemintosentencesbyusingtheStanfordCoreNLPlibrary.[61]
• Toolkit. CLEAN-EP included three general-purpose cNLP tools: (1) cTAKES (clinical Text
AnalysisandKnowledgeExtractionSystem),[2]ancNLPtoolforinformationextractionfrom
freetextclinicalnotesinEHR;(2)MetaMap,[3]atoolforrecognizingUMLSconceptsintext;
and(3)MedEx,[4]atoolspecializedinextractingmentionsofmedications.Thesetoolscover
awiderangeofapplicationsofcNLPforclinicalnoteinformationextractionandservethe
basicneedsofclinicalnoteprocessing.CLEAN-EPalsointegratedtwospecializedcNLPtools
forpSCANNER[22]conditions:KD-NLP,[8]atoolspecializedinidentifyingclinicalsignsofKD;
and ROCK (Rules for Obesity, Congestive heart failure, and Kawasaki disease), a newly
developedcNLPtoolwhich isspecialized inextractingcommondataelements forWM/O,
CHF,andKD.ThepipelineofROCKconsistedofsentencesplitting,rule-basedtaggingusing
regularexpressions,negationdetectionwithNegEx[67],andsynonymidentificationusing
SNOMED-CT[56]synonyms,LOINC[57]componentnames,RxNorm[58]tradenames,aswell
as known names for KD found on four public domain medical websites: WebMD [68],
MedScape[69],RxList [70],andDrugs.com[71].Notethat,cTAKESexploitstheDictionary
LookupFastPipeline[72]andthebuilt-inconceptdictionary,whichisasubsetofUMLS[59]
includingSNOMED-CT[56],RxNorm[58],andallsynonyms.Itshouldalsobenotedthatour
previousstudyonlyincludedcTAKESandMetaMap,[1]whileCLEAN-EPcurrentlyintegrates
fivecNLPtools:cTAKES,MetaMap,MedEx,KD-NLPandROCK.
• Ensembler. CLEAN-EP adopted the Union ensemble method, reported to consistently
outperformasinglecNLPtool.[1]TheUnionmethodfocusesonimprovingthecoverageof
thecNLPresults,byextractingadataelementifatleastoneofthecNLPtoolsidentifiesthat
dataelement.[1]
• Postprocessor.CLEAN-EPgeneratesthepre-annotationsintheformatofBRAT[21]forthe
evaluationprocesses.
FigureA.1TheflowchartoftheCLEANensemblepipeline(CLEAN-EP).[1]Theinputsweredata
elementsandclinicalnotes,whiletheoutputswerethepre-annotatedclinicalnotes.Theoriginal
ensemblepipeline[1]containedtwocNLPtools:cTAKES[2](clinicalTextAnalysisandKnowledge
ExtractionSystem)andMetaMap.[3]Toincreasetheextractioncoverage,CLEAN-EPintegrated
threeadditionalcNLPtools:MedEx,[4]KD-NLP,[8]andanewlydevelopedtoolROCK(Rulesfor
Obesity, Congestive heart failure, and Kawasaki disease). Also, CLEAN adopted the Union
ensemblemethod.
Toolkit Ensembler Postprocessor
Union AnnotationTags
ClinicalNotes
Pre-AnnotatedClinicalNotes
………...........
………..........
......
.......
.......
………........
………..........
.....
......
.....
..
..
..
..
Preprocessor
FormatConverter
EncodingConverter
SentenceSplitter
cTAKES
DataElements
MetaMap
MedEx
KD-NLP
ROCK
FigureA.2.AnexampleshowingthecompletionstatusofclinicalnotesintheCLEANannotation
tool(CLEAN-AT).Ausercouldrecheckcompletednotes(inthisexample,theupperthreenotes)
byclickingtheorange“Recheck”buttons,andcouldreviewuncompletednotes(inthisexample,
thelowerfivenotes)byclickingtheblue“Review”buttons.
CompletedNotes
UncompletedNotes
A.2TheNote-LevelandSentence-LevelF1-Scores
Atnote-level,anannotatedmentionofadataelementwasregardedasatruepositiveifthedata
element appeared in the gold standard annotations for the samenote. At sentence-level, an
annotateddataelementmentionwasatruepositiveifitappearedinthesamesentenceasany
goldstandardannotation for thedataelement. Inourexperiment,dataelementannotations
were consideredbinaryatboth levels, and thereforemultipleannotationsofadataelement
wouldonlybecountedasatruepositive.Forexample,considerthefollowingsentence:“The
patientisonbeta-blockersandCoumadin,continuebeta-blockers.”Althoughbothmentionsof
themedicationconcept“beta-blockers”arecorrectlyannotated,thetruepositivecountisstill
onewhilecomputingthesentence-levelF1-score.Thesamerulealsoholdsforthecomputation
ofnote-levelF1-score.Thedefinitionofnote-levelF1-scoreofaclinicalnotewastheharmonic
meanofprecisionPandrecallRformulatedas2*P*R/(P+R),whereP=(#oftruepositive
annotateddataelements)/(#ofuniquedataelementsannotatedinthenote),andR=(#oftrue
positiveannotateddataelements)/(#ofuniquegoldstandarddataelementsinthenote).The
definitionofsentence-levelF1-scoreofaclinicalnotewasS(Fi)/N,whereN=#ofsentencesin
theclinicalnote,Fi=2*Pi*Ri/(Pi+Ri)foreachsentencei,Pi=(#oftruepositiveannotated
dataelementsinsentencei)/(#ofuniquedataelementsannotatedinsentencei),andRi=(#of
truepositiveannotateddataelementsinsentencei)/(#ofuniquegoldstandarddataelements
insentencei).
Figure A.3. The results of the linear mixed-effects model using sentence-level F1-score as
responseandannotatorsasrandomeffect.
●
●●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
GradStud MD
CHF
KD
2 4 6 2 4 6
0.0
0.3
0.6
0.9
0.0
0.3
0.6
0.9
ConceptFreq
F1sentence Software
●
●
BRAT
CLEAN
A.3TheInterviewResults
ThecriticaleventsandoverallfeedbackofBRATandCLEANareasfollows:
• Interview for critical events of BRAT. The reported/observed issues are: (1) no pre-
annotations as “red flags” existed to assist the annotation process; (2) users tended to
highlightaterm(orevena largeparagraphofthetext)accidentally,andthusrequiredan
additionalcanceloperation;and(3)ifuserszoomedthebrowserwindowinforlargertext,
thenthebrowsercutthepopupwindowatthebottom.
• InterviewforcriticaleventsofCLEAN.Thereported/observedissuesare:(1)itconstantlypre-
annotatedsometermsasthewrongdataelement,forexampleall“history”mentionswere
pre-annotated as “past medical history,” thus the users were required to repeatedly
delete/fix these data elements; (2) the coverage of medication pre-annotation could be
improved;and(3)basedoncurrentdesign,thecopy-and-pastedidnotworkwellwhiletrying
toquerytheconceptsviasearchengineslikeGoogle.
• InterviewforoverallfeedbackofBRAT.Theannotatorsreportedthefollowingadvantagesof
BRAT:(1)itwaseasytolearn;(2)itwassimpleandstraightforwardtounderstand,without
toomanybellsandwhistles;(3)itprovidednavigationtonextnoteswithsimpleclicking;and
(4)itprovidedauto-savingfeature.TheannotatorsreportedthefollowingburdensofBRAT:
(1)theprocessingwasslow;(2)theusersneededtoclickmanytimestoselectannotation
text; (3)afterannotation, the screendidnot remainat the siteofannotation,but rather
shiftedback;and(4)searchingfordataelements,especiallymedicationones,sloweddown
theannotationprocess.
• InterviewforoverallfeedbackofCLEAN.Theannotatorsreportedthefollowingbenefitsof
CLEAN:(1) itwasveryquickandeasyto learn;(2) itdidagoodjobofpre-annotation;(3)
adding and deleting an annotation were easy; (4) the user felt more interaction while
annotating;(5)itwaseasytouse,assumingtheaccuratelearningaspectofthesoftware;(6)
itwasagreat tool forannotatingclinicalnotes,especiallyonaconditionbasis;and (7) it
providedgreatpotentialforpredictiveanalysis.Theusersreportedthefollowingobstacles:
(1)itwouldbehelpfulifwithasingle-clickonaword(insteadofclicking-and-selectingthe
wordfollowedbyarightclick),thatwordgothighlightedwiththeannotationdataelement
reminderopenedautomatically;(2)inselectingtheannotation,itmightbehelpfultoallow
keyboardusetoselectanoption,oncethespellinghasstarted;(3)userscouldnotadjustfont
sizeeasily;and(4)itmightbehelpfultoaddmoremedicationdefinitionsandsynonymsto
thedictionariesofthecNLPtoolstoincreasethepre-annotationcoverage.
FigureA.4.Theresultsofnormalizeduseractivities,includingkeyboardpressesandmouseclicks,
for physician (MD) and graduate student (GradStud) annotators. The activity count was
normalizedbyusingthelengthofclinicalnote(perword).
0.0000
0.0200
0.0400
0.0600
0.0800
0.1000
0.1200
0.1400
BRAT(GradStud) CLEAN(GradStud) BRAT(MD) CLEAN(MD)
ActivitiesperWords
Length-NormalizedUserActivities
Mouse Keyboard
FigureA.5.Fullresultsoftheusefulness/satisfactionsurveys,completedbythegraduatestudent
(GradStud)andM.D.(MD)annotators.
Survey(Scale) Category Question BRAT
GradStudCLEAN
GradStudBRATMD
CLEANMD
Usefulness(1– 7)
PerceivedUsefulness
Usingthesysteminmyjobwouldenablemetoaccomplishtasksmorequickly 6 7 2 7
Usingthesystemwouldimprovemyjobperformance 7 7 2 7Usingthesysteminmyjobwouldincreasemyproductivity 7 7 2 7Usingthesystemwouldenhancemyeffectivenessonthejob 5 7 2 7Usingthesystemwouldmakeiteasiertodomyjob 6 7 2 7Iwouldfindthesystemusefulinmyjob 6 7 2 7
PerceivedEaseofUse
Learningtooperatethesystemwouldbeeasyforme 7 7 5 7IwouldfinditeasytogetthesystemtodowhatIwantittodo 7 6 4 6Myinteractionwiththesystemwouldbeclearandunderstandable 7 6 4 6Iwouldfindthesystemtobeflexibletointeractwith 7 7 3 6Iwouldfindthesystemeasytouse 7 7 4 6
Satisfaction(0 – 9)
TerminologyandSystemInformation
Useoftermsthroughoutsystem:inconsistent(0)- consistent(9) 9 9 4 8Terminologyrelatedtotask:never(0)- always(9) 7 5 3 8Positionofmessagesonscreen:inconsistent(0)- consistent(9) 9 9 4 8Promptsforinput:confusing(0)- clear(9) 9 9 3 8Computerinformsaboutitsprogress:never(0)- always(9) N/A 0 1 7Errormessages:unhelpful(0)- helpful(9) N/A 9 2 8
Learning
Learningtooperatethesystem:difficult(0)- easy(9) 9 9 5 8Exploringnewfeaturesbytrialanderror:difficult(0)- easy(9) 9 9 3 8Rememberingnamesanduseofcommands:difficult(0)- easy(9) 5 N/A 4 8Performingtasksisstraightforward:never(0)- always(9) 9 9 4 8Helpmessagesonthescreen:unhelpful(0)- helpful(9) N/A 9 3 8Supplementalreferencematerials:confusing(0)- clear(9) 8 9 3 8
OverallReactionto the
Software
terrible(0)- wonderful(9) 8 8 3 8difficult(0)- easy(9) 8 9 7 8frustrating(0)- satisfying(9) 9 9 4 8inadequatepower(0)- adequatepower(9) 8 7 4 8dull(0)- stimulating(9) 8 9 3 8rigid(0)- flexible(9) 9 9 2 8
Screen
Readingcharactersonthescreen:hard(0)- easy(9) 9 9 3 8Highlightingsimplifiestask:notatall(0)- verymuch(9) 5 9 3 8Organizationofinformation:confusing(0)- veryclear(9) 9 7 3 8Sequenceofscreens:confusing(0)- veryclear(9) 9 9 5 8
SystemCapabilities
Systemspeed:tooslow(0)- fastenough(9) 7 9 2 7Systemreliability:unreliable(0)- reliable(9) 8 9 3 7Systemtendstobe:noisy(0)- quite(9) 5 9 7 9Correctingyourmistakes:difficult(0)- easy(9) 9 9 3 6Designedforalllevelsofusers:never(0)- always(9) 9 9 3 7
TableA.1.Timeanalysisresults(GradStud=GraduateStudent,MD=MedicalDoctor).
Condition Annotator SoftwareAverageTimeperNote
(minutes)StandardDeviationof
TimeperNote(minutes)
CongestiveHeartFailure(CHF)
GradStudBRAT 9.962 5.611CLEAN 11.808 8.266
MDBRAT 6.154 4.315CLEAN 3.500 2.045
KawasakiDisease(KD)
GradStudBRAT 13.813 10.647CLEAN 10.063 5.360
MDBRAT 3.500 2.191CLEAN 3.188 1.471
TableA.2.UseractivitiesforBRATandCLEAN,averagedforbothgraduatestudentandphysician
usersandnormalizedbyusingthelengthofclinicalnote(perword).
SoftwareMouseCount
KeyboardCount
TotalCount
WordCount
MouseNormalized
Count
KeyboardNormalized
Count
TotalNormalized
CountBRAT 2,629.5 1,553.0 4,182.5 44,219 0.059 0.035 0.094CLEAN 2,057.0 1,667.5 3,724.5 48,998 0.042 0.034 0.076
TableA.3.Summaryoftheusefulness/satisfactionsurveyresultsforphysician(MD)andgraduate
student (GradStud) annotators. The upper-part summarizes the survey results for perceived
usefulnessandeaseofuse,whilethelower-parttheresultsforuserinterfacesatisfaction.Our
studyexcludedfoursatisfactorysurveyquestionswithN/Aanswerswhencomputingtheaverage
scoresforeachcategoryofquestions.
Survey(Scale)
Category BRAT(GradStud)
CLEAN(GradStud)
BRAT(MD)
CLEAN(MD)
Usefulness(1–7)
PerceivedUsefulness 6.17 7.00 2.00 7.00PerceivedEaseofUse 7.00 6.60 4.00 6.20
Satisfaction(0–9)
OverallReactiontotheSoftware 8.33 8.50 3.83 8.00Screen 8.00 8.50 3.50 8.00
TerminologyandSystemInformation 8.50 8.00 3.50 8.00Learning 8.75 9.00 3.75 8.00
SystemCapabilities 7.60 9.00 3.60 7.20
REFERENCES
1.KuoT-T,RaoP,MaeharaC,etal.EnsemblesofNLPToolsforDataElementExtractionfrom
ClinicalNotes.AMIAAnnualSymposium,2016.
2.SavovaGK,MasanzJJ,OgrenPV,etal.MayoclinicalTextAnalysisandKnowledgeExtraction
System (cTAKES): architecture, component evaluation and applications. Journal of the
American Medical Informatics Association : JAMIA 2010;17(5):507-13 doi:
10.1136/jamia.2009.001560[publishedOnlineFirst:EpubDate]|.
3.AronsonAR,LangF-M.AnoverviewofMetaMap:historicalperspectiveandrecentadvances.
JournaloftheAmericanMedicalInformaticsAssociation2010;17(3):229-36
4.XuH,StennerSP,DoanS,JohnsonKB,WaitmanLR,DennyJC.MedEx:amedicationinformation
extraction system for clinical narratives. Journal of the AmericanMedical Informatics
Association2010;17(1):19-24
5. Clinical Language Annotation, Modeling, and Processing Toolkit (CLAMP).
http://clamp.uth.edu/(accessedFebruary6,2017).
6.PattersonOV,JonesM,YaoY,etal.ExtractionofVitalSignsfromClinicalNotes.Studies in
healthtechnologyandinformatics2014;216:1035-35
7.Garvin JH,DuVall SL, SouthBR,etal.Automatedextractionofejection fraction forquality
measurement using regular expressions in Unstructured Information Management
Architecture (UIMA) for heart failure. Journal of the American Medical Informatics
Association : JAMIA 2012;19(5):859-66 doi: 10.1136/amiajnl-2011-000535[published
OnlineFirst:EpubDate]|.
8.DoanS,MaeharaCK,ChaparroJD,etal.BuildingaNaturalLanguageProcessingTooltoIdentify
PatientsWithHighClinicalSuspicionforKawasakiDiseasefromEmergencyDepartment
Notes.AcademicEmergencyMedicine2016;23(5):628-36
9.SouthBR,ShenS,LengJ,ForbushTB,DuVallSL,ChapmanWW.Aprototypetoolsettosupport
machine-assistedannotation.Proceedingsofthe2012WorkshoponBiomedicalNatural
LanguageProcessing:AssociationforComputationalLinguistics,2012:130-39.
10.ByrdRJ,SteinhublSR,SunJ,EbadollahiS,StewartWF.Automaticidentificationofheartfailure
diagnostic criteria, using text analysis of clinical notes from electronic health records.
International Journal of Medical Informatics 2014;83(12):983-92 doi:
http://dx.doi.org/10.1016/j.ijmedinf.2012.12.005[publishedOnlineFirst:EpubDate]|.
11.ChenY,LaskoTA,MeiQ,DennyJC,XuH.Astudyofactivelearningmethodsfornamedentity
recognitioninclinicaltext.Journalofbiomedicalinformatics2015;58:11-18
12.LingrenT,DelegerL,MolnarK,etal.Evaluatingtheimpactofpre-annotationonannotation
speed andpotential bias: natural languageprocessing gold standard development for
clinicalnamedentityrecognitioninclinicaltrialannouncements.JournaloftheAmerican
Medical Informatics Association 2013;21(3):406-13 doi: 10.1136/amiajnl-2013-
001837[publishedOnlineFirst:EpubDate]|.
13. Ogren PV, Savova GK, Chute CG. Constructing evaluation corpora for automated clinical
namedentity recognition.Medinfo2007: Proceedingsof the12thWorldCongresson
Health(Medical)Informatics;BuildingSustainableHealthSystems:IOSPress,2007:2325.
14.SouthBR,MoweryD,SuoY,etal.Evaluatingtheeffectsofmachinepre-annotationandan
interactiveannotation interfaceonmanualde-identificationof clinical text. Journalof
biomedicalinformatics2014;50:162-72
15. Grouin C, Névéol A. De-identification of clinical notes in French: towards a protocol for
reference corpusdevelopment. Journal ofBiomedical Informatics 2014;50:151-61doi:
http://dx.doi.org/10.1016/j.jbi.2013.12.014[publishedOnlineFirst:EpubDate]|.
16.EscudiéJ-B,JannotA-S,ZapletalE,etal.Reviewing741patientsrecordsintwohourswith
FASTVISU.AMIA,2015.
17.DuvallSL,ForbushTB,CorniaRC,etal.Reducingthemanualburdenofmedicalrecordreview
throughinformatics.PharmacoepidemiologyandDrugSafety2014;23:415
18.SavovaGK,ChapmanWW,ZhengJ,CrowleyRS.Anaphoricrelationsintheclinicalnarrative:
corpuscreation.JournaloftheAmericanMedicalInformaticsAssociation2011;18(4):459-
65
19. SL D, RC C, TB F, CH H, V. PO. Check it with Chex : A Validation Tool for Iterative NLP
Development. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA
Symposium,2014.
20. Chex. http://department-of-veterans-affairs.github.io/Leo/components.html - /Chex
(accessedFebruary6,2017).
21.StenetorpP,PyysaloS,TopicG,OhtaT,AnaniadouS,TsujiiJi.BRAT:aweb-basedtoolfor
NLP-assistedtextannotation.ProceedingsoftheDemonstrationsatthe13thConference
oftheEuropeanChapteroftheAssociationforComputationalLinguistics.Avignon,France:
AssociationforComputationalLinguistics,2012:102-07.
22. Ohno-Machado L, Agha Z, Bell DS, et al. pSCANNER: patient-centered Scalable National
Network for Effectiveness Research. Journal of the American Medical Informatics
Association2014;21(4):621-26
23.AronsonAR.EffectivemappingofbiomedicaltexttotheUMLSMetathesaurus:theMetaMap
program.Proceedings/AMIA...AnnualSymposium.AMIASymposium2001:17-21doi:
citeulike-article-id:300324[publishedOnlineFirst:EpubDate]|.
24. McCray AT, Trevvett P, Frost HR. Modeling the autism spectrum disorder phenotype.
Neuroinformatics2014;12(2):291-305
25.Twoapproachestointegratingphenotypeandclinicalinformation.AMIA;2009.
26. Feupe SF, Lin K, Kuo T-T, et al. Review and Evaluation of the State of Standardization of
ComputablePhenotype.AMIAAnnualSymposium,2016.
27. Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical
documents frommultiple data sources. Journal of the AmericanMedical Informatics
Association2011;18(5):580-87
28. Esuli A, Marcheggiani D, Sebastiani F. An enhanced CRFs-based system for information
extractionfromradiologyreports.Journalofbiomedicalinformatics2013;46(3):425-35
29.ChenQ,LiH,TangB,etal.Anautomaticsystemtoidentifyheartdiseaseriskfactorsinclinical
textsovertime.Journalofbiomedicalinformatics2015;58:S158-S63
30.DoanS,CollierN,XuH,DuyPH,PhuongTM.Recognitionofmedication informationfrom
discharge summaries using ensembles of classifiers. BMC medical informatics and
decisionmaking2012;12(1):1
31.KangN,AfzalZ,SinghB,VanMulligenEM,KorsJA.Usinganensemblesystemtoimprove
concept extraction from clinical records. Journal of biomedical informatics
2012;45(3):423-28
32.KimY,RiloffE.AStackedEnsembleforMedicalConceptExtractionfromClinicalNotes.AMIA
Joint Summits on Translational Science proceedings AMIA Summit on Translational
Science2015;2015
33.MTSamples.http://www.mtsamples.com/(accessedAugust19,2015).
34. Galitz WO. The essential guide to user interface design: an introduction to GUI design
principlesandtechniques:JohnWiley&Sons,2007.
35. Smith SL,Mosier JN.Guidelines for designing user interface software:Mitre Corporation
Bedford,MA,1986.
36.HolzingerA.Usabilityengineeringmethodsforsoftwaredevelopers.Communicationsofthe
ACM2005;48(1):71-74
37. A javascript library for building user interfaces react. https://facebook.github.io/react/
(accessedOctober17,2016).
38. Heart Failure Fact Sheet.
https://www.cdc.gov/dhdsp/data_statistics/fact_sheets/fs_heart_failure.htm (accessed
June12,2017).
39. About Kawasaki Disease. https://www.cdc.gov/kawasaki/about.html (accessed June 12,
2017).
40.UzunerÖ,LuoY,SzolovitsP.Evaluatingthestate-of-the-artinautomaticde-identification.
JournaloftheAmericanMedicalInformaticsAssociation2007;14(5):550-63
41. Uzuner Ö, Goldstein I, Luo Y, Kohane I. Identifying patient smoking status frommedical
discharge records. Journal of the American Medical Informatics Association
2008;15(1):14-24
42.UzunerÖ.Recognizingobesity and comorbidities in sparsedata. Journalof theAmerican
MedicalInformaticsAssociation2009;16(4):561-70
43. Uzuner Ö, Solti I, Xia F, Cadag E. Community annotation experiment for ground truth
generation for the i2b2 medication challenge. Journal of the American Medical
InformaticsAssociation2010;17(5):519-23
44.UzunerÖ,SoltiI,CadagE.Extractingmedicationinformationfromclinicaltext.Journalofthe
AmericanMedicalInformaticsAssociation2010;17(5):514-18
45.UzunerÖ,SouthBR,ShenS,DuVallSL.2010i2b2/VAchallengeonconcepts,assertions,and
relations in clinical text. Journal of the American Medical Informatics Association
2011;18(5):552-56
46.UzunerO,BodnariA,ShenS,ForbushT,PestianJ,SouthBR.Evaluatingthestateoftheartin
coreferenceresolutionforelectronicmedicalrecords.JournaloftheAmericanMedical
InformaticsAssociation2012;19(5):786-91
47. SunW, Rumshisky A, Uzuner O. Evaluating temporal relations in clinical text: 2012 i2b2
Challenge.JournaloftheAmericanMedicalInformaticsAssociation2013;20(5):806-13
48.StubbsA,UzunerÖ.Annotatinglongitudinalclinicalnarrativesforde-identification:The2014
i2b2/UTHealthcorpus.Journalofbiomedicalinformatics2015;58:S20-S29
49.StubbsA,KotfilaC,UzunerÖ.Automatedsystemsforthede-identificationof longitudinal
clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1. Journal of
biomedicalinformatics2015;58:S11-S19
50.StubbsA,UzunerÖ.Annotatingriskfactorsforheartdiseaseinclinicalnarrativesfordiabetic
patients.Journalofbiomedicalinformatics2015;58:S78-S91
51. Stubbs A, Kotfila C, Xu H, Uzuner Ö. Identifying risk factors for heart disease over time:
Overviewof2014i2b2/UTHealthsharedtaskTrack2.Journalofbiomedicalinformatics
2015;58:S67-S77
52. JohnsonAE, Pollard TJ, Shen L, et al.MIMIC-III, a freely accessible critical care database.
Scientificdata2016;3
53.SuominenH,SalanteräS,VelupillaiS,etal.OverviewoftheShARe/CLEFeHealthevaluation
lab2013.InternationalConferenceoftheCross-LanguageEvaluationForumforEuropean
Languages:Springer,2013:212-31.
54.KellyL,GoeuriotL,SuominenH,etal.Overviewoftheshare/clefehealthevaluationlab2014.
International Conference of the Cross-Language Evaluation Forum for European
Languages:Springer,2014:172-91.
55. Goldberger AL, Amaral LA, Glass L, et al. Physiobank, physiotoolkit, and physionet
components of a new research resource for complex physiologic signals. Circulation
2000;101(23):e215-e20
56. Stearns MQ, Price C, Spackman KA, Wang AY. SNOMED clinical terms: overview of the
developmentprocessandprojectstatus.ProceedingsoftheAMIASymposium:American
MedicalInformaticsAssociation,2001:662.
57.McDonaldCJ,HuffSM,SuicoJG,etal.LOINC,auniversalstandardforidentifyinglaboratory
observations:a5-yearupdate.Clinicalchemistry2003;49(4):624-33
58. Liu S, Ma W, Moore R, Ganesan V, Nelson S. RxNorm: prescription for electronic drug
informationexchange.ITprofessional2005;7(5):17-23
59. Bodenreider O. The unified medical language system (UMLS): integrating biomedical
terminology.Nucleicacidsresearch2004;32(suppl1):D267-D70
60.CohenJ.Statisticalpoweranalysisforthebehavioralsciences.Hillsdale,NJ1988:20-26
61.ManningCD,SurdeanuM,BauerJ,FinkelJR,BethardS,McCloskyD.TheStanfordCoreNLP
NaturalLanguageProcessingToolkit.ACL(SystemDemonstrations),2014:55-60.
62.KukrejaU,StevensonWE,RitterFE.RUI:RecordinguserinputfrominterfacesunderWindows
andMacOSX.BehaviorResearchMethods2006;38(4):656-59
63.Morgan JH,ChengC-Y,PikeC,RitterFE.Adesign, testsandconsiderations for improving
keystrokeandmouseloggers.InteractingwithComputers2013:iws014
64.OpenBroadcasterSoftware(OBS)Studio.https://obsproject.com/(accessedJune8,2016).
65.DavisFD.Perceivedusefulness,perceivedeaseofuse,anduseracceptanceofinformation
technology.MISquarterly1989:319-40
66.ChinJP,DiehlVA,NormanKL.Developmentofaninstrumentmeasuringusersatisfactionof
thehuman-computerinterface.ProceedingsoftheSIGCHIconferenceonHumanfactors
incomputingsystems:ACM,1988:213-18.
67.ChapmanWW,BridewellW,HanburyP,CooperGF,BuchananBG.Asimplealgorithmfor
identifyingnegatedfindingsanddiseasesindischargesummaries.Journalofbiomedical
informatics2001;34(5):301-10
68.WebMD.http://www.webmd.com/(accessedAugust3,2015).
69.MedScape.http://www.medscape.com/(accessedAugust3,2015).
70.RxList.http://www.rxlist.com/(accessedAugust3,2015).
71.Drugs.com.http://www.drugs.com/(accessedAugust3,2015).
72.HealthNLP.http://healthnlp.github.io/examples/(accessedAugust19,2015).