Gregory F. Buchanan2, Ph.D., MehmetKayaalp1, M.S., K. 'Center Buchanan.pdfradiology reports, and discharge summaries). This paper describes a semi-automated technique that is intended

Using Computer Modeling to Help Identify Patient Subgroupsin Clinical Data Repositories

Gregory F. Cooper', M.D., Ph.D., Bruce G. Buchanan2, Ph.D.,Mehmet Kayaalp1, M.D., M.S., Melissa Saul3, John K. Vries4, M.D.

'Center for Biomedical Informatics, University of Pittsburgh2Department of Computer Science, University of Pittsburgh3University of Pittsburgh Medical Center Health System

4University of Pittsburgh School of Medicine

OBJECTIVE: The ability to accurately and efficientlyidentify patient cases of interest in a hospitalinformation system has many important clinical,research, educational and administrative uses. Theidentification of cases of interest sometimes can bedifficult. This paper describes a two-stage method forsearching for cases of interest.DESIGN: First, a Boolean search is performed usingcoded database variables. The user classifies theretrieved cases as being of interest or not. Second,based on the user-classified cases, a computer modelof the patient cases of interest is constructed. Themodel is then used to help locate additional cases.These cases provide an augmented training set forconstructing a new computer model of the cases ofinterest. This cycle of modeling and userclassification continues until halted by the user.MEASUREMENTS: This paper describes a pilotstudy in which this method is used to identify therecords of patients who have venous thrombosis.RESULTS: The results indicate that computermodeling enhances the identification of patient casesof interest.

INTRODUCTION

A hospital information system (HIS) can greatlyfacilitate retrospective studies of patient cases. Withsuch a system, researchers also can identify potentialparticipants in prospective clinical studies. Both usesof a HIS require the identification of patientsubgroups of interest. The more accurately andefficiently such subgroups of patients can beidentified, the better.

The accurate and efficient identification of patientsubgroups of interest extend beyond research studiesto include clinical, educational and administrativeuses. For example, one educational application wouldbe the identification of interesting patient cases foruse in teaching medical students.

If a patient subgroup can be defined by arelatively small and simple combination of coded datafields in a HIS (e.g., ICD-9 codes, laboratory values,and patient demographic variables), then finding therelevant hospital records is relatively straightforward.The task becomes considerably more challenging,however, when a simple Boolean combination of datafields does not readily locate all the patient cases of

interest. In such situations, it may be necessary toperform more complex searches that involve codedinformation and possibly unstructured information(e.g., free-text history and physical examinations,radiology reports, and discharge summaries). Thispaper describes a semi-automated technique that isintended to assist users in performing these morecomplex searches.

BACKGROUND

We are using the MARS (Medical ARchival System)hospital information system at the University ofPittsburgh Medical Center [1], which contains a richstore of both coded and free-text clinical information.This section describes a current method that iscommonly used in identifying patient subgroups inMARS.

Patient subgroups are often identified in MARSas a team effort by MARS staff and users. Wecharacterize here a typical collaboration. The user(e.g., a researcher) first describes the target populationto the staff member, and through a discussion, theyrefine the description to construct an initial MARSquery. Typically the staff and user first identify a setof patient reference cases, which are patient records inMARS that match the study criteria. These casesoften are obtained using a Boolean search of MARS.The reference cases are reviewed by the staff memberto get an idea of the search terms to use in findingadditional relevant records, which we call recordmatches.

After developing a refined search query inconsultation with the user, the staff member performsthe search and reviews the results with the user. Ifthere are too many records to review exhaustively,then a sample of the records is reviewed. Through thisprocess, a set of record matches is assembled. Areview of these records often provides additional hintsabout how to further refine the search query, which isrun, and the cycle repeats.

The search and review cycle terninates when theuser is satisfied with the set of record matches thathave been found. The union of the reference cases andall record matches constitute the set of records thatdefine the patient cases of interest to be used infurther analyses.

1091-8280/98/$5.00 © 1998 AMIA, Inc. 180

There are two labor-intensive aspects to theprocess of constructing a patient subgroup of interest:(a) constructing complex search criteria beyond theinitial Boolean search, and (b) reviewing the retrievedresults of searches to identify record matches andrefine the queries further. We propose a method tofacilitate these two steps for the staff and user team,which for brevity we generically call the user. Thecurrent paper focuses primarily on the first step.

COMPUTER-ASSISTEDIDENTIFICATION OF PATIENT

SUBGROUPS

In this section, we provide an overview of our basicapproach to using computer modeling to help identifypatient subgroups in MARS. Figure 1 provides aschematic sumnary of the approach. We currently arein the process of implementing and integrating all thesteps in this approach.

The reference cases hold important clues to thepatient features that distinguish the patient subgroupof interest from the other patients with records inMARS. We are investigating the use of computermodeling methods to characterize explicitly thosedistinguishing clues. The reference cases are used aspositive instances in a training set. Negativeinstances are either provided directly by the user or arerandomly selected from the entire set of MARSrecords if it is highly unlikely that a random hospitalrecord satisfies the study criteria. The training setcontaining positive and negative instances is used toconstruct a computer model. The variables in themodel can represent both coded and free-text patientdata, although initially we are using only coded data.

The computer model contains a set of patientfeatures (clues) that characterize approximately thepatient subgroup of interest. Conceptually, eachpatient record in MARS is compared to the model todetermine a numeric score that indicates how closelythe record and the model match (in practice, theimplementation can be more efficientcomputationally). The most closely matching recordsare presented to the user, sorted in descending order oftheir scores. As outlined in Figure 1, the user reviewsrecords in the order they appear in the sorted list, andin the process indicates record matches, until the yieldof matches becomes so low that he or she stops thereview. To help maintain confidentiality, theserecords contain no explicit patient or clinicianidentifiers (e.g., name, address, social securitynumber); currently we perform this de-identificationprocess manually, but we plan to investigate methodsfor automating it. We hypothesize that (a) thisselective review of the records will be significantlyfaster than the current more manual process ofreviewing records, and (b) more record matches will

be found using the system than are found usingcurrent methods. The current paper describes apreliminary study that provides a step towardaddressing these hypotheses.

The records that are marked by the user asmatches or non-matches become part of a growingtraining set of patient records. The augmented trainingset is used to learn a new, refined computer model,and the cycle of search and review continues.Ultimately, the user will decide to terminate thesearch and review process. At that point the union ofthe reference cases and all record matches become thestudy set that is exported for use.

RELATED RESEARCH

Conceptually, the basic methodology presented in thecurrent paper is closely related to relevance feedbackresearch in information retrieval [2]. In relevancefeedback, a user performs an initial search andclassifies a sample of the retrieved documentsregarding whether or not they are of interest. Thisfeedback is used to re-weight and possibly expand thesearch terms that are then used in subsequentretrievals. Other related research is being pursued bymachine-learning researchers [3].

The line of research we are pursuing isdistinguished along two basic ways from most of theprior work on relevance feedback. First, we areinterested in using a highly heterogeneous collectionof information in the electronic medical record,including free text of several types (e.g., dictatedhistory and physical examinations, surgical pathologyreports, and discharge summaries) as well as a varietyof types of coded information (e.g., demographicinformation, patient charge codes (see below), andlaboratory results). The current paper focuses onlearning models using patient charge codes; it servesas a baseline study for future extensions that involvemodeling with additional information in the electronicmedical record. Second, we are interested in applyinga wide variety of statistical and machine-learningmethods for modeling. Some of these methods (e.g.,neural networks and Bayesian networks) are able tomodel multivariate interactions among features. Byusing a simple Bayes classification model, the currentpaper serves as a baseline for planned investigationsthat use more sophisticated modeling techniques.

EXPERIMENTAL METHODS

This section describes a pilot study that involves theidentification of those patients in the intensive careunit with a venous thrombosis (VY). The primarygoal of the original study was to eventually locatejust those patients with a deep venous thrombosis.

181

New cases of interest(new record matches) user indic

are in the

cases are added to thetraining set

cases are model istransferred Cases of constructed

interestfor training [

4 user decides to stop _searching for more cases

Finalcases of interest(de-identified)

cates which casessubgroup of interest

model

Possible new cases ofinterest (de-identified)

patientidentifiers are

s removed from

RS case recordss Possible

--00 new casesIof interestI

MARSdatabase

Figure 1. An outline of a method for identifying patient subgroups that uses computer modeling.

We investigated an approach in which initial cases ofVT are obtained by a simple search of coded featuresof the total study population. The question is whetherthere are other cases of VT that computer modelingcan help a user efficiently locate.

The total study population consists of allpatients admitted to two medical ICUs at theUniversity of Pittsburgh Medical Center HealthSystem (UPMC-HS) between January 1, 1993 andDecember 31, 1995. There were a total of 3020patients. Of these, 124 patients were identified ashaving VT, based on being assigned one or more ofthe following ICD-9 codes at discharge: other venousthrombosis (ICD-9 code 453), Budd-Chiari syndrome(453.0), thrombophlebitis migrans (453.1), vena cavasyndrome (453.2), renal vein thrombosis (453.3),venous thrombosis nec (453.8), and venousthrombosis nos (453.9). An ICU specialist reviewedthe MARS patient records and verified that the recordssupport a VT diagnosis.

In the current experiment, we randomly dividedthe 124 VT cases into two parts. A total of 74 caseswere placed in a training dataset. For the purpose ofour study, we imagine that this dataset contains allthe cases found by a Boolean search of the ICD-9codes; suppose, for example, that the ICD-9 coding ofrecords had been less complete, and only 74 caseswere coded as VT. We use the remaining 50 cases as atest dataset. We investigate to what extent computermodeling can help efficiently locate some or all ofthose 50 cases without using ICD-9 codes; so, forexample, if the ICD-9 coding had been incomplete,we are investigating the extent to which the uncoded50 cases could be located by means other than theICD-9 codes.

Of the 3020 ICU patient cases, we randomlyselected 1802 cases for use in preliminaryexperiments that investigated model construction

methods. To avoid testing bias, in performing thefinal experiment reported here, we used only theremaining 1218 cases (3020 - 1802). These 1218cases contained 50 VT cases, as mentionedpreviously; none of these 50 cases appeared in the1802 cases used in preliminary experiments. Thefraction of VT cases in the entire set of 3020 isapproximately 4 percent. Likewise, the fraction of VTcases in the set of 1218 cases is approximately 4percent; thus, our evaluation dataset of 1218 casesprovides an unbiased sample of the original patientpopulation of 3020 cases.

We used patient charge codes as the features(variables) with which to construct computer modelsof VT cases. A charge code is a single item charged toa patient during his or her hospitalization. Forexample, the code 33001371 is used to represent acharge for "warfarin sodium S mg tablet". There areapproximately 45,000 distinct items that can becharged at UPMC-HS. Among the 3020 cases in ourstudy database, only 7035 distinct charge codesappeared. Since charge codes can be very specific, andthus lead to small sample sizes, we augmented these7035 charge code features with 1081 abstract features.In the current pilot study, an abstract feature wascreated for each unique first word within the chargecode descriptions; as an example, the abstract chargecode heparin corresponds to the disjunction of allcharge codes whose descriptions begin with the word"heparin". For the current study, we choose to usecharge codes because (a) they represent many of theactions taken in caring for patients, and thus, are arich source of clinical information, and (b) they aresufficiently voluminous and complex in theirorganization that using them in a structured Booleansearch would be difficult for most users. Werepresented a given charge code as a binary variable.The value of a charge code is set to true if the given

182

Initialcases ofinterest(thereferencecases)

charge was made at any time during a patient'shospitalization; otherwise, the value was set to false.

We used a simple Bayes system as a statisticalmodel, because it is simple to code, efficient to learnand apply, and it often performs well in practice. Asimple Bayes system assumes that features (e.g.,charge codes) are independent conditioned on adiagnosis (e.g., VT). Although this assumptionusually is not valid, simple Bayes models have beenshown to be robust to violations of the assumptionwhen the models are used for classification [4]. Thatis not to say that improvements cannot be made byusing more sophisticated models, which we intend todo in future research.

Suppose that d represents a diagnosis orprediction (e.g., VT). Let d' denote that d is bound tosome particular value (e.g., VT= present). Let fidenote a feature (e.g., a charge for a 5 mg warfarintablet) that is bound to some particular value (e.g.,either bound to the value true orfalse). Let F denote aset of n features, each of which is bound to somevalue. A simple Bayes model computes theprobability of d' given F as follows:

nP(d')HP(fi d')

P(d' F) = i=l (1)n

(P(d) P(fi d))d i=l

where n is the number of features in the model andthe sum is taken over all the values of d (e.g., d =present and d = absent). For any specific values offand d, we estimated the term P (Jf d) as(freq(fi, d) + 1)/(freq(d) + 2), wherefreq(f, d) is thenumber of times that f. and d co-occur in the trainingdatabase. Similarly, we estimated P(d) as(freq(d) + 1)/(N + 2), where N is the total numberof cases in the training database. When the samplesize is small, these estimates are less extreme than aremaximum likelihood estimates based on the simpleratios freq(fi,d)/freq(d) and freq(d)/N.

When there are a large number of features, simpleBayes models often have better predictive performanceif feature selection is applied. Thus, rather than usemany thousands of charge codes in Equation 1, asearch method is applied to select a relatively smallnumber of features that are highly predictive of thediagnosis. We used a simple feature selection method:in Equation 1 we only used those features with thevalue true in at least one case in the training set.

Initially (cycle 0) we started with a training setcontaining 74 cases of patients known to have hadVT and 74 cases known not to have had VT. Theseconstitute the reference cases denoted in Figure 1.Following the figure, a simple Bayes model was

constructed from those 148 cases. This model wasapplied to the 1218 cases in the test set and aprobability of VT was computed for each case usingEquation 1. The probabilities were sorted indescending order. We assumed that a user would bewilling to review the 40 most probable VT cases andclassify each case as VT or not. In actual practice,such a review would involve the user examining theMARS record of each of the 40 patients to assesswhether the patient had experienced VT. Such anexamination could involve looking at structuredinformation (e.g., laboratory values) or free text (e.g.,discharge summaries). In our experiments wesimulated the assessments that a user would give. Wewere able to do so because we assume that among the3020 cases the only ones with VT were those 124cases described above.

We repeated this cycle of 40 patients for 10cycles. We stopped after 10 cycles because it seemsunlikely that a user would be willing to review morethan 400 cases.

RESULTS

Figure 2 shows the key results. The x axis representsthe number of cycles taken in Figure 1. In a singlecycle, the 40 most probable VT cases (outside thetraining set) are presented to the user forclassification. The y axis shows the total number ofVT cases found, which is equal to the 74 referencecases plus the new record matches as verified by theuser.

Figure 2. A plot of the total number of VTcases located as a function of the modeling cycle.

The upper line in Figure 2 is the maximumperformance that could be achieved; since only 40cases are selected in each cycle, the maximum numberof cases that can be identified just after the first cycleis 40 + 74 = 114. The line containing triangles

183

1249 . . . ....

104 -- -le -

'94 I*/ @*-*e**

.89

84 A Bw us* @ *.se

740 1 2 3 4 5 6 7 8 9 1[0

cycl

shows the performance of applying the methoddescribed in the previous section. The lower line isthe performance that would be expected whenrandomly selecting 40 patient cases (from outside thetraining set) during each cycle; this performancecorresponds to randomly searching through the ICUpatients in search of VT cases beyond the original 74.

DISCUSSION

The results in Figure 2 indicate that computermodeling is helping locate VT cases that are notincluded in the original set of 74 reference cases. Inthe first cycle, 7 new VT cases are included amongthe 40 cases presented to the user. Based on VThaving a prior probability of 4 percent in this dataset,we would expect to find only 1 or 2 VT cases in arandom sample of 40 cases. The performance ofcomputer modeling continues at about the same rate,until the fourth cycle. A plausible interpretation isthat as more and more VT cases are discovered, theremaining cases become relatively fewer and aretherefore more difficult to locate; thus, the slopebegins to flatten at about the fourth cycle. Also, themost apparent VT cases are likely to be located in theearly cycles, leaving the more difficult cases for latercycles. Another factor is that we presently are usingonly charge codes as model features. We intend toexpand the features modeled to include demographicinformation, ICD-9 codes, and DRG codes; we thenplan to investigate the extent to which theseadditional features change the system's performance.

In the present study, we have assumed that all2896 (i.e., 3020 - 124) cases that are not known to beVT are in fact not VT. As a check on thatassumption, we plan to have an ICU specialistexamine cases of VT that are predicted by the systemand that are not among the 124 cases alreadyestablished to be VT. We are interested in the fractionof those cases that the specialist designates as beingVT cases. The larger the fraction the better. As asupplementary check, we also plan to select afeasible-sized random sample of the 2896 cases andpresent these to the specialist for evaluation. Fromthe fraction designated by the specialist as VT caseswe can estimate the total number of VT cases withinthe entire dataset of 3020 cases.

In the near term, we also plan to expand ourinvestigation to the identification of other patientsubgroups. In the longer term, our plans include therepresentation of features that are extracted from freetext in the medical record (e.g., the dischargesummary). Our emphasis in representing free textwill be to explore relatively simple methods that willbe computationally tractable on a database the size ofMARS. For example, one basic approach is to mapfree-text phrases into UMLS Metathesaurus conceptsusing available mapping tools. Each identifiedconcept represents one variable that characterizes a

part of the free-text report. We also will explore awide variety of statistical and machine-learningmodels, including neural networks, Bayesiannetworks, and rule-based systems. We plan to providea web interface that will allow a user to perform aninitial set of Boolean queries on coded features, thenextend the search for patient cases using computermodeling. The interface will permit the user toexamine and directly alter the model, if desired. Anaudit trail will be maintained of all user decisionsregarding classification of cases in order to documentfor each case the user's justification for it being ofinterest or not.

When the overall search system is sufficientlymature, we plan to perform a randomized, controlexperiment with a wide variety of users and searchtasks in order to compare the computer modelingapproach to the current approach. The primary metricswe anticipate using are the total patient cases ofinterest that are found using each approach (casecapture), the total amount of user and staff timeexpended in obtaining those cases, and the users'assessments of the utility of the case capture giventhe total expended time.

Acknowledgments

We thank Dr. Michael Donahoe for his help indefining the dataset that we used in the pilot studyreported here. We thank Dr. Charles Friedman forhelpful comments on the design and development ofthis project. This research was supported in part byIAIMS grant I-G08-LM06625 from the NationalLibrary of Medicine to the University of Pittsburgh.

References

1. Yount RJ, Vries JK, Councill CD. The MedicalArchival System: An information retrieval systembased on distributed parallel processing.Information Processing Management 1991;27:379.

2. Hannan D. Relevance feedback and other querymodification techniques. In: Frakes WB, Baeza-Yates R, eds. Information Retrieval: DataStructures and Algorithms. Englewood Cliffs, NJ:Prentice Hall, 1992: 241-263.

3. Nigam K, McCallum A, Thrum S, Mitchell T.Learing to classify text from labeled and unlabeleddocuments. National Conference on ArtificialIntelligence (AAAI), 1998.

4. Domingos P. Pazzani M. On the optimality of thesimple Bayesian classifier under zero-one loss.Machine Learning 1997;29:103-130.

184

Documents

Gregory F. Buchanan2, Ph.D., MehmetKayaalp1, M.S., K. 'Center Buchanan.pdfradiology reports, and discharge summaries). This paper describes a semi-automated technique that is intended