7
Artificial Intelligence in Medicine 57 (2013) 111–117 Contents lists available at SciVerse ScienceDirect Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim Assessing and combining repeated prognosis of physicians and temporal models in the intensive care Lilian Minne a,, Tudor Toma a , Evert de Jonge b , Ameen Abu-Hanna a a Department of Medical Informatics, Academic Medical Center, PO Box 22660, 1100 DD Amsterdam, The Netherlands b Department of Intensive Care, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands article info Article history: Received 9 October 2011 Received in revised form 12 July 2012 Accepted 26 August 2012 Keywords: Logistic regression models Temporal patterns Predictive performance Human intuition Intensive care abstract Objective: Recently, we devised a method to develop prognostic models incorporating patterns of sequen- tial organ failure to predict the eventual hospital mortality at each day of intensive care unit (ICU) stay. In this study, we investigate using a real world setting how these models perform compared to physicians, who are exposed to additional information than the models. Methods: We developed prognostic models for days 2–7 of ICU stay by data-driven discovery of patterns of sequential qualitative organ failure (SOFA) scores and embedding the patterns as binary variables in three types of logistic regression models. Type A models include the severity of illness score at admission (SAPS- II) and the SOFA patterns. Type B models add to these covariates the mean, max and delta (increments) of SOFA scores. Type C models include, in addition, the mean, max and delta in expert opinion (i.e. the physicians’ prediction of mortality). Results: Physicians had a statistically significantly better discriminative ability compared to the models without subjective information (AUC range over days: 0.78–0.79 vs. 0.71–0.74) and comparable accuracy (Brier score range: 0.15–0.18 vs. 0.16–0.18). However when we combined both sources of predictions, in Type C models, we arrived at a significantly superior discrimination as well as accuracy than the objective and subjective models alone (AUC range: 0.80–0.83; Brier score range: 0.13–0.16). Conclusion: The models and the physicians draw on complementary information that can be best har- nessed by combining both prediction sources. Extensive external validation and impact studies are imperative to further investigate the ability of the combined model. © 2012 Elsevier B.V. All rights reserved. 1. Introduction In the intensive care unit (ICU), physicians are daily challenged with the complex task of prognosis. Prognosis can be defined as the prediction of (the probability of) an event, such as death, before its possible occurrence [1,2]. Patient preferences to undergo inten- sive life sustaining treatments highly depend on their prognosis. In fact, they often prefer palliative care aiming at comfort and relief of pain if their chances of survival are very low [3,4] and clinicians’ perceptions of survival chances strongly influence pro- vision of life support [5]. Survival is usually defined as hospital survival, which is the probability that a particular patient will be discharged alive from the ICU or hospital ward after an ICU admis- sion. This means that a patient that was discharged alive from the ICU to a hospital ward but died in the ward will be considered as a non-survivor. Corresponding author. Tel.: +31 205666893; fax: +31 206919840. E-mail addresses: [email protected] (L. Minne), [email protected] (T. Toma), E.de [email protected] (E. de Jonge), [email protected] (A. Abu-Hanna). Predictions of survival can be obtained either subjectively (e.g. expert opinion) or objectively (e.g. mathematical models). The main advantage of an objective approach is that mathematical models are extremely consistent and able to optimally combine information into a global judgment [6]. Commonly used mathe- matical models for predicting mortality in the ICU are the acute physiology and chronic health evaluation (APACHE) I–IV [7] model, the simplified acute physiology assessment (SAPS) I–III [8] models, and the mortality prediction model (MPM) I–II [9]. These models are intended to provide the expected mortality risk for a given patient by adjusting for the patient’s severity of illness at admission to the ICU. These models are mainly used in a comparative audit among ICUs: the observed mortality in an ICU is compared to the expected mortality in that ICU. In comparison to objective predictions, subjec- tive predictions may be inexact or even inconsistent [10,11]. There can be substantial disagreement between different clinicians, even if they are both very experienced [12], and prognosis is the part of medical practice they feel most insecure about [13]. On the other hand, clinicians are able to incorporate important implicit knowl- edge outside the scope of mathematical models and react quickly to changing situations [6]. 0933-3657/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.artmed.2012.08.005

Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

  • Upload
    ameen

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

Ai

La

b

a

ARRA

KLTPHI

1

wpisIrcvsdsIn

E

0h

Artificial Intelligence in Medicine 57 (2013) 111–117

Contents lists available at SciVerse ScienceDirect

Artificial Intelligence in Medicine

journa l homepage: www.e lsev ier .com/ locate /a i im

ssessing and combining repeated prognosis of physicians and temporal modelsn the intensive care

ilian Minnea,∗, Tudor Tomaa, Evert de Jongeb, Ameen Abu-Hannaa

Department of Medical Informatics, Academic Medical Center, PO Box 22660, 1100 DD Amsterdam, The NetherlandsDepartment of Intensive Care, Leiden University Medical Center, PO Box 9600, 2300 RC Leiden, The Netherlands

r t i c l e i n f o

rticle history:eceived 9 October 2011eceived in revised form 12 July 2012ccepted 26 August 2012

eywords:ogistic regression modelsemporal patternsredictive performanceuman intuition

ntensive care

a b s t r a c t

Objective: Recently, we devised a method to develop prognostic models incorporating patterns of sequen-tial organ failure to predict the eventual hospital mortality at each day of intensive care unit (ICU) stay. Inthis study, we investigate using a real world setting how these models perform compared to physicians,who are exposed to additional information than the models.Methods: We developed prognostic models for days 2–7 of ICU stay by data-driven discovery of patterns ofsequential qualitative organ failure (SOFA) scores and embedding the patterns as binary variables in threetypes of logistic regression models. Type A models include the severity of illness score at admission (SAPS-II) and the SOFA patterns. Type B models add to these covariates the mean, max and delta (increments)of SOFA scores. Type C models include, in addition, the mean, max and delta in expert opinion (i.e. thephysicians’ prediction of mortality).Results: Physicians had a statistically significantly better discriminative ability compared to the modelswithout subjective information (AUC range over days: 0.78–0.79 vs. 0.71–0.74) and comparable accuracy

(Brier score range: 0.15–0.18 vs. 0.16–0.18). However when we combined both sources of predictions, inType C models, we arrived at a significantly superior discrimination as well as accuracy than the objectiveand subjective models alone (AUC range: 0.80–0.83; Brier score range: 0.13–0.16).Conclusion: The models and the physicians draw on complementary information that can be best har-nessed by combining both prediction sources. Extensive external validation and impact studies areimperative to further investigate the ability of the combined model.

. Introduction

In the intensive care unit (ICU), physicians are daily challengedith the complex task of prognosis. Prognosis can be defined as therediction of (the probability of) an event, such as death, before

ts possible occurrence [1,2]. Patient preferences to undergo inten-ive life sustaining treatments highly depend on their prognosis.n fact, they often prefer palliative care aiming at comfort andelief of pain if their chances of survival are very low [3,4] andlinicians’ perceptions of survival chances strongly influence pro-ision of life support [5]. Survival is usually defined as hospitalurvival, which is the probability that a particular patient will beischarged alive from the ICU or hospital ward after an ICU admis-

ion. This means that a patient that was discharged alive from theCU to a hospital ward but died in the ward will be considered as aon-survivor.

∗ Corresponding author. Tel.: +31 205666893; fax: +31 206919840.E-mail addresses: [email protected] (L. Minne), [email protected] (T. Toma),

.de [email protected] (E. de Jonge), [email protected] (A. Abu-Hanna).

933-3657/$ – see front matter © 2012 Elsevier B.V. All rights reserved.ttp://dx.doi.org/10.1016/j.artmed.2012.08.005

© 2012 Elsevier B.V. All rights reserved.

Predictions of survival can be obtained either subjectively (e.g.expert opinion) or objectively (e.g. mathematical models). Themain advantage of an objective approach is that mathematicalmodels are extremely consistent and able to optimally combineinformation into a global judgment [6]. Commonly used mathe-matical models for predicting mortality in the ICU are the acutephysiology and chronic health evaluation (APACHE) I–IV [7] model,the simplified acute physiology assessment (SAPS) I–III [8] models,and the mortality prediction model (MPM) I–II [9]. These models areintended to provide the expected mortality risk for a given patientby adjusting for the patient’s severity of illness at admission to theICU. These models are mainly used in a comparative audit amongICUs: the observed mortality in an ICU is compared to the expectedmortality in that ICU. In comparison to objective predictions, subjec-tive predictions may be inexact or even inconsistent [10,11]. Therecan be substantial disagreement between different clinicians, evenif they are both very experienced [12], and prognosis is the part of

medical practice they feel most insecure about [13]. On the otherhand, clinicians are able to incorporate important implicit knowl-edge outside the scope of mathematical models and react quicklyto changing situations [6].
Page 2: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

1 ence i

(cmissortpj

ttowmbspesmda

dpsbd

2

2

iaodlr(iSuoopttaamoac29wsiu

12 L. Minne et al. / Artificial Intellig

Several mathematical models have been compared to cliniciansi.e. expert opinion). In general, clinicians seem to have good dis-riminative ability [12,14–21,11,22,23] that is superior to objectiveodels [24]. Objective models tend however to have better cal-

bration [16,17]. Capitalizing on the individual strengths of bothubjective and objective predictions by using a combined approacheems to yield superior discrimination and calibration over eitherne alone [16,18,11,22,25,26]. However, the problem of comparingepeated predictions for the same patient over time between thewo approaches has not been studied before, and in fact there isaucity of work on repeated predictions in either approach (sub-

ective or objective).Repeated predictions are important to better individualize

he prognosis for a patient as more information obtained overime becomes available. For example, the probability of survivalf patients suffering major complications during their ICU stayill decrease. The current admission-based severity-of-illness ICUodels are less suitable for individual prognosis because they are

ased on only physiologic data from the first 24 h after ICU admis-ion. We recently suggested a method for developing “temporal”rognostic models for providing hospital survival predictions atach day of ICU stay [27]. Specifically, we showed that the use ofpecific patterns of sequential data (e.g. the transition from nor-al renal function to renal failure) lead to models that can better

iscriminate between survivors and non-survivors and are moreccurate than the current “static” models [28,29].

The aim of this study is to investigate and compare daily pre-iction in adult ICU patients based on subjective and objectiverognostic information. In particular we investigate the followingources of predictions: (1) expert opinion (daily survival estimatesy physicians) and (2) three types of temporal models includingifferent levels of objective and subjective information.

. Methods

.1. Data collection

We designed a prospective comparative study in which wencluded all consecutive patients admitted between October 2007nd October 2008 to a 28-bed multidisciplinary mixed adult ICUf a 1002-bed university hospital in Amsterdam. We extractedemographics, patient outcomes and all data necessary to calcu-

ate the severity of illness score in the first 24 h (SAPS-II, whichanges between 0 and 163) and the daily organ failure scoresSOFA scores, ranging between 0 and 24) from the Dutch nationalntensive care evaluation (NICE) registry [30]. The value of aOFA score is the summation of 6 individual organ system fail-re scores each ranging from 0 to 4 indicating the degree ofrgan failure of this particular system where 0 indicates normalrgan functioning and 4 indicates complete failure). Every 24 h,hysicians’ estimates of the probability of survival up to hospi-al discharge were elicited and recorded in a software modulehat we developed and integrated into the patient data man-gement system (PDMS). A PDMS is a computerized system thatutomatically collects and stores vital parameters from the patientonitor and provides digital patient charts [31]. After completion

f routine data collection, a question regarding the patient’s prob-bility of survival popped up automatically in which physiciansould choose between 10 probability categories (0–10%, 10–20%,0–30%, 30–40%, 40–50%, 50–60%, 60–70%, 70–80%, 80–90% and0–100%) or state that they did not have any clue. Categories

ere indicated by red (low survival probabilities) and green (high

urvival probabilities) color-scales. The scoring moment was typ-cally between 9 and 12 am, but estimates could be changedntil 12 pm.

n Medicine 57 (2013) 111–117

2.2. Physicians

At the time we started this study, our ICU employed 10 crit-ical care attending physicians, 8 fellows and approximately 14residents, all of which participated in the study. Critical care attend-ing physicians are specialists (e.g. neurologists or cardiologists)who have completed an additional intensive care specializationof two years. Fellows and residents have completed at least twoyears of post-MD training. Fellows are in training for the intensivecare specialization and residents are in training to become anes-thesiologists. Physicians were unaware of both their colleagues’assessments and the predictions given by the objective models.They were not trained to estimate survival probabilities and didnot receive feedback. Physicians were notified of this study by emailand by an announcement during their staff meeting.

2.3. Development of mathematical models

A model was developed for each day of ICU stay from day 2 to day7 by the procedure described in [28]. This procedure involves twosteps: (1) data-driven discovery of temporal patterns of sequentialorgan failure scores and (2) embedding them in the familiar logisticregression model. As in the original work in [28], SOFA scores werecategorized as low (L) if SOFA � {0, . . ., 6}, medium (M) if SOFA �{7, 8} or high (H) if SOFA � {9, . . ., 24} and patterns were alignedto the day of prediction. For example, the pattern M,H on day 4(the day of prediction) means a high SOFA score on day 4 precededby a medium score on day 3. In our systematic review on SOFA-based models [32], we found two promising strategies leading togood performing models (not necessarily for repeated predictions):the first was our approach based on temporal patterns (describedabove) and the second was the combination of SOFA abstractionswith admission scores (e.g. SAPS-II). Commonly used abstractionsof SOFA are its mean, maximum (both from the day of admissiond0 to day of prediction di), and delta (difference in scores betweenday di and d0). In this study, we developed logistic regression mod-els containing three levels of information: Type A models whichuse SAPS-II and SOFA patterns; Type B models which in additionuse mean, max and delta SOFA scores; and Type C models which inaddition to the variables in Type B models also use mean, max anddelta of expert predictions (by translating the survival predictions,e.g. “60–70%” into a mortality probability, e.g. 0.35). These variableswere then considered candidates for inclusion as binary covari-ates (0: the patient does not exhibit the pattern, and 1: the patientexhibits the pattern) in the temporal logistic regression models. TheAkaike information criterion (AIC) [33] was used to select the opti-mal subset of covariates yielding the best model (i.e. the one withthe lowest possible AIC). The AIC is defined as −2ln(L) + 2k where Lis the maximized likelihood of the model and k is the number of freeparameters in the model. The AIC trades off predictive performancefor parsimony by penalizing for the number of variables includedin the model in order to avoid overfitting [28]. The variable selec-tion procedure iteratively eliminates the least predictive covariate(i.e. the one associated with the highest AIC) of a set of consid-ered covariates until the AIC cannot be decreased any further. Todevelop a Type A model, we first only included patterns of length 1.Only those that survived the AIC selection procedure were includedtogether with patterns of length 2, etc. To prevent collinearity, wedid not allow for logically entailed patterns to be selected in themodel: we sought the model with the lowest AIC that had a subsetof variables without logical entailment. For example, consider thethree patterns L,H; M,H; and H. The presence of either L,H or of M,H

entails H. In order to avoid entailment among variables we com-pare (using the AIC) a model including L,H and M,H without H, witha model including H alone. For the Type B models, we selected themodel with the lowest AIC from the following four models: Type A,
Page 3: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

L. Minne et al. / Artificial Intelligence in Medicine 57 (2013) 111–117 113

an exa

Twsicdao

2

burdlb0itrtBis

B

wi

Fig. 1. Data extraction and model development including

ype A + mean SOFA, Type A + max SOFA, Type A + delta SOFA. Thene did the same for the Type C models with the Type B models as a

tarting point. The data extraction and model development processs shown in Fig. 1. Finally, for comparison reasons, we developedlassification trees based on Type C information for each day of pre-iction. We first completely overgrew the tree to overfit the datand then pruned back the overgrown tree to its optimal size basedn its minimum 10-fold cross-validation error.

.4. Statistical analysis

Discrimination (i.e. the ability of the model to distinguishetween survivors from non-survivors) was measured by the areander the receiver operating characteristic curve (AUC). An AUCanges between 0 and 1 with higher values indicating betteriscriminative ability. Although the AUC is dependent on the preva-

ence of the event in the sample, it is common to consider an AUCetween 0.6 and 0.7 as poor, between 0.7 and 0.8 as fair, between.8 and 0.9 as good and above 0.9 as excellent. The Brier score

s a measure of accuracy, which has both aspects of discrimina-ion and calibration (i.e. the degree of similarity between predictedisks and patients’ actual risks of death). Note that, according to theerminology of [34], the term accuracy in this paper refers to therier score and not the percentage of correctly classified cases as

t is sometimes used in machine learning. The formula of the Briercore is:

S = 1N∑

(p − o )2

Ni=1

i i

here N is the number of observations, p the predicted probabil-ty and o the observed outcome (0: survival, 1: non-survival). The

mple with hypothetical patient data for a model at day 3.

larger the Brier score the worse the accuracy. Parameter estimatesand confidence intervals were obtained by calculating the boot-strap sampling distributions of the respective statistics based on1000 bootstrap samples. Records with missing values of SAPS-IIwere excluded from the analysis, while missing physicians’ predic-tions were imputed by taking the value of the previous day or, whennot available, the mean of all predictions available for the patientinvolved. Analyses were conducted in the R statistical environment.

3. Results

3.1. Patient characteristics

The baseline characteristics of the 912 patients admitted duringthe study period are shown in Table 1. 3955 physicians-predictionsof survival were registered, and 623 values (13.6%) were miss-ing. Physicians indicated in 77 cases (1.5%) that they had no idearegarding a patient’s chance of survival. There were no missing val-ues in SOFA scores and only two missing values in SAPS-II (0.2%).Hospital mortality was 19.2%.

3.2. Distributions of SOFA scores and expert opinion

Fig. 2 shows histograms of expert opinion (i.e. the physician’spredictions of mortality) and SOFA scores. Mean ± standard devi-ation of the SOFA score was 7.87 ± 1.85. Mean predicted mortalityby physicians was 37.6% (1.96 times the actual mortality). Meanstandard deviation of expert opinion per patient (over the days)

was 13.3%. To assess whether expert opinion could reliably iden-tify high-risk patients, all patients with predicted mortality above90% were obtained yielding 135 patients. Of these 74.8% actuallydied (this is the positive predictive value at 0.9).
Page 4: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

114 L. Minne et al. / Artificial Intelligence in Medicine 57 (2013) 111–117

Table 1Patient characteristics. IQR: inter-quartile range, LOS: length of stay, SAPS: simplified acute physiology score, SD: standard deviation, and SOFA: sequential organ failureassessment.

Survivors n = 737 Non-survivors n = 175 Total population

Admission type (%)Medical 34.9 65.2 40.7Urgent 16.7 25.1 18.3Planned 48.4 9.7 41.0

Mean age (SD) (years) 60.3 (16.5) 64.0 (15.4) 61.0 (16.4)Male/female (%) 64.3/35.7 56.6/43.4 62.8/37.2Median LOS [IQR] (days) 2 [0–5] 3 [1–9] 2 [1–6]Mean SAPS-II (SD) 44.0 (15.3) 57.2 (18.8) 46.5 (16.8)Mean APACHE-II ± (SD) 18.2 (6.7) 24.7 (7.4) 19.5 (7.3)Mean SOFA (SD) 7.0 (3.7) 10.1 (4.1) 7.9 (4.0)Median # organ failure [IQR] 5 [3–14] 12 [5–32] 6 [3–17]Mean expert opinion (SD) (%) 29.5 (24.2) 57.7 (27.7) 37.6 (28.2)

pert o

3

aom

wapop

at

Fig. 2. Distribution of ex

.3. Mathematical models

The logits (log odds) of the best temporal models for days 2–6re summarized in Table 2. For example the best temporal modelf Type A on day 4 is described by the following logistic regressionodel:

e−2.06+0.022∗SAPSII+0.47∗H−0.95∗L,L

1 + e−2.06+0.022∗SAPSII+0.47∗H−0.95∗L,L

here H means high SOFA score on day 4 (the day of prediction),nd L,L means low SOFA scores on days 3 and 4. Consider for exam-le the hypothetical patient from Fig. 1, which has a SAPS-II scoref 25. If this patient would have a medium SOFA score on day 4, therobability of dying according to this model would be:

e−2.06+0.022∗SAPSII+0.47∗0−0.95∗0

1 + e−2.06+0.022∗SAPSII+0.47∗0−0.95∗0= 18.1%

s both patterns are absent. If the SOFA pattern was L,L, however,

he probability of dying would be reduced to:

e−2.06+0.022∗SAPSII+0.47∗0−0.95∗1

1 + e−2.06+0.022∗SAPSII+0.47∗0−0.95∗1= 7.9%

pinion and SOFA scores.

Note that Table 2 shows no Type B models after day 2 as TypeA models had a lower AIC than all of the following models: TypeA + mean SOFA, Type A + max SOFA and Type A + delta SOFA. Thismeans that, after day 2, mean, max and delta SOFA had no signifi-cant added prognostic value over the SOFA patterns.

Fig. 3 shows the classification tree that was fit for day 4. Expertopinion abstractions were always selected as the root of the clas-sification tree (mean on days 2 and days 5–7, and maximum ondays 3 and 4). Mean SOFA was included as one of the variables inthree of the six tree-models, while SAPS-II (on day 2), the patternL,L (on day 4) and delta expert opinion (on day 6) were includedin one of the six tree-models. In our classification tree presenta-tion moving to the right branch means an increased probabilityof dying compared to the left branch. For example, the tree ofday 4 shows that max expert opinion of at least 80% and meanSOFA score of at least 6.375 increase the probability of dying,while the pattern L,L has a protective effect and decreases thisprobability.

3.4. Validation

Predictive performance of the temporal models and expert opin-ion is summarized in Fig. 4. The AUC of expert opinion ranged

Page 5: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

L. Minne et al. / Artificial Intelligence in Medicine 57 (2013) 111–117 115

Table 2Description of the developed logistic regression models. SAPSII = simplified acutephysiology score-II, L: low (score of 0–7), M: medium (score of 7–8), H: high (scoreof 9–24), expert: expert opinion, i.e. the physician’s predictions of mortality. Notethat no Type B models are shown after day 2 as for these days Type A models had alower AIC than Type A + mean SOFA, Type A + max SOFA and Type A + delta SOFA.

Day Model type Model description

2 Type A model −2.1 + 0.034 * SAPSII −1.24 * L − 0.959 * MType B model −4.22 + 0.032 * SAPSII + 0.16 * mean SOFAType C model −3.62 + 0.017 * SAPSII + 0.44 * L + 4.33 * mean expert

3 Type A model −2.0 + 0.026 * SAPSII − 1.436 * LType C model −2.5 − 1.28 * L + 4.36 * mean expert

4 Type A model −2.06 + 0.022 * SAPSII + 0.47 * H − 0.95 * L,LType C model −2.9 − 1.24 * L,L + 5.09 * mean expert

5 Type A model −1.48 + 0.021 * SAPSII − 1.44 * LType C model −2.6 − 1.35 * L + 4.91 * mean expert

6 Type A model −2.05 + 0.018 * SAPSII + 1.17 * H − 0.88 * L,L,L,LType C model −3.17 + 0.76 * H − 1.25 * L,L,L,L + 5.06 * mean expert

7 Type A model −1.03 + 1.49 * H − 1.02 * L,L,L,L,L

fre((0lw2sdaldB

Ftpmp

Fig. 4. Predictive performance of physicians and temporal models. AUC: area underthe receiver operating curve. Note that no predictive performance of Type B models

Type C model −3.26 + 0.91 * H − 1.47 * L,L,L,L,L + 5.35 * mean expert

rom 0.78 (on day 7) to 0.79 (on days 2–6), while the Brier scoreanged from 0.15 (day 2) to 0.18 (days 5–7). Type A/B mod-ls were comparable to physicians in terms of the Brier scorerange: 0.16–0.18), but significantly worse in terms of the AUCrange: 0.71–0.74). Type C models had higher (better) AUCs (range:.80–0.83) than physicians for all days of prediction as well as

ower (better) Brier scores (range: 0.13–0.16). The differencesere statistically significant on day 2 for the AUC and on days

–4 for the Brier score. Type C models were also statisticallyignificantly better compared to Type A/B models for all days of pre-iction. The classification tree had worse AUCs (range: 0.67–0.79)nd Brier scores (range: 0.13–0.17) compared to the temporalogistic regression model. The differences were significant for all

ays except for day 6 for the AUC and for days 3 and 5 for therier score.

ig. 3. Classification tree for day 4. The candidate variables were the same as forhe Type C temporal logistic regression models. Mxexp: max expert opinion (i.e.redicted probability of mortality by the physician), mnexp: mean expert opinion,nsofa: mean SOFA score (all three from the day of admission until day 4), X2:

attern L,L (i.e. low SOFA score on days 3 and 4).

is shown after day 2 as for these days Type A models had a lower AIC than TypeA + mean SOFA, Type A + max SOFA and Type A + delta SOFA.

4. Discussion

In this study, we compared the performance of physicians andmathematical models in predicting mortality in the ICU. Physi-cians had fair to good discriminative ability (AUC range: 0.78–0.79),which was statistically significantly better compared to the mathe-matical models (AUC range: 0.71–0.74). The accuracy of physicians(Brier score range: 0.15–0.18) was comparable to the accuracy ofthe mathematical models without subjective information (Brierscore range: 0.16–0.18). Nevertheless, physicians seem to markedlyoverestimate mortality. The mean predicted mortality by physi-cians (37.6%) was much higher than the observed mortality (19.2%)and patients estimated as high risk patients (>0.9 probability ofdeath) still had a realistic chance to survive (25.2%). Mathematicalmodels were significantly improved by adding subjective informa-tion (AUC range 0.80–0.83 and Brier score range: 0.13–0.16), whichmade them superior to physicians.

Remarkably, for most prediction days no models were foundthat included abstractions of the SOFA score. This shows thatthe SOFA patterns have superior predictive value than the SOFAabstractions. The combined model only included the SAPS-II scoreon day 2, while it was included in the other model types untilday 6. Similarly, the tree-model included the SAPS-II score onlyon day 2. Apparently, the SAPS-II score looses its predictive valuewith increasing ICU stay days as it only contains information per-taining to the first day of admission. This explanation is confirmedby the increasingly reduced value of the SAPS-II coefficient overdays. As expected, the expert opinion coefficient in the combinedmodels is positive, whereas SOFA patterns that only include the“low” category have a negative coefficient (indicating that thesepatterns have a protecting effect in respect to mortality). No otherpatterns have been selected. These other patterns may either havelow frequency in the dataset or low additive predictive value. Inrecursive partitioning, expert opinion was always chosen as theroot of the classification tree. The high coefficients of expert opinion

in the logistic regression models also confirm that this may be themost important predictor of mortality. This can be partly explainedas physicians are the ones who are responsible for making
Page 6: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

1 ence i

epmotws

tcntorsmotfp

epApno[

ic)ihtAbttmoaplafittatacii

oiadaseaAf

[

[

[

[

[

[

[

[

[

[

16 L. Minne et al. / Artificial Intellig

nd-of-life decisions. Of note, only one tree-model used a SOFAattern. Compared to the tree-models, the logistic regressionodels had however superior predictive performance in terms

f both accuracy and discrimination. It is likely that once theree had made use of the most predictive variables it was leftith relatively too little data to render the SOFA patterns as

ignificant.To our knowledge the present pilot study is unique, and

he first in comparing models with daily predictions of physi-ians. The idea of models with repeated predictions is relativelyew, and these models have not been combined with subjec-ive information before, at least not in the ICU. Other strengthsf this study are the use of bootstrapping techniques to cor-ect for optimistic estimates of predictive performance because ofimilarities in the training and testing datasets, the AIC based incre-ental variable selection and the comparison of the techniques

f logistic regression and recursive partitioning. Weaknesses ofhis study include the use of a relatively small dataset and theact that different physicians provided estimates for the sameatients.

Other studies scrutinizing expert opinion on a daily basisither validated the most recent prediction only [25] or did notrovide direct measures of predictive performance, such as theUC or error rates [5,35]. In general, other studies validatingredictive ability of physicians also found fair to good discrimi-ation [12,14–21,11,22,23] and superior predictive performancef a combined approach of subjective and objective information16,18,11,22,25,26].

We found that the mathematical models can be significantlymproved by the inclusion of subjective information. These modelsan provide some patient groups with information about (non-survival probabilities, and can be potentially useful to supportndividual decision-making. Although some of these patients willave very high probabilities to die and may prefer palliative care,he models might also help physicians not to withdraw therapy.s physicians tend to overestimate mortality, the models mighte able to prevent unjustified decisions leading to withdrawal ofreatment in patients who would otherwise have a good chanceo survive. Although one may argue that a patient would benefit

ore from an accurate prediction on the day of admission, thann the day before his or her non-survival, it is often hard to sayt an early stage which patients will survive ICU admission. Onlyatients that may benefit from ICU admission are admitted, but it is

ikely that their chances for survival during admission will changes their condition will change for either better or worse. Indeed weound in this study that SAPS-II may loose its predictive value withncreasing ICU days of stay. Our models provide additional informa-ion on the patient’s probability of survival during ICU admissionhat may be used in addition to the admission model. Moreover,n advantage of including expert opinion in the final prognosis ishat it may lead to higher acceptance of the models by physiciansnd patients or their families. Before they can make their way intolinical practice, however, extensive external validation and stud-es on their impact on clinical decisions and patient outcomes aremperative [36].

Future research needs to focus on questions about how toptimize these temporal models, the potential value of includ-ng nurses’ predictions of mortality, their external validity, theircceptability by clinicians and their potential impact on clinicalecisions and patient outcomes. Note also that in our study the AUCnd Brier scores were generally better on the first days of admis-ion, but that the patients in later days form subsets of those in

arlier days because on each day some patients leave the ICU (theyre either discharged alive or die). Investigating the behavior of theUC and Brier scores over time for the same patient group merits

uture research.

[

n Medicine 57 (2013) 111–117

5. Conclusion

Mathematical models have fair but worse discriminationthan physicians, but better calibration. They can be significantlyimproved by the inclusion of subjective information. This may leadto higher acceptance of the models by physicians and patientsor their families to support individual decision-making in patientgroups with very high risk of dying. Before they can make their wayinto clinical practice, however, extensive external validation andstudies on their impact on clinical decisions and patient outcomesare imperative [36].

Acknowledgement

This work was partially supported by the Netherlands Organi-zation for Scientific Research (NWO) under the I-Catcher projectnumber 634.000.020.

References

[1] Lucas PJ, Abu-Hanna A. Prognostic methods in medicine. Artificial Intelligencein Medicine 1999;15:105–19.

[2] Abu-Hanna A, Lucas PJ. Prognostic models in medicine. AI and statisticalapproaches. Methods of Information in Medicine 2001;40:1–5.

[3] Fried TR, Bradley EH, Towle VR, Allore H. Understanding the treatmentpreferences of seriously ill patients. New England Journal of Medicine2002;346:1061–6.

[4] Murphy DJ, Burrows D, Santilli S, Kemp AW, Tenner S, Kreling B, et al.The influence of the probability of survival on patients’ preferencesregarding cardiopulmonary resuscitation. New England Journal of Medicine1994;330:545–9.

[5] Rocker G, Cook D, Sjokvist P, Weaver B, Finfer S, McDonald E, et al. Clin-ician predictions of intensive care unit mortality. Critical Care Medicine2004;32:1149–54.

[6] Whitecotton SM, Sanders DE, Norris KB. Improving predictive accuracy with acombination of human intuition and mechanical decision aids. OrganizationalBehavior and Human Decision Processes 1998;76:325–48.

[7] Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity ofdisease classification system. Critical Care Medicine 1985;13:818–29.

[8] Le Gall JR, Lemeshow S, Saulnier F. A new simplified acute physiology score(SAPS II) based on a European/North American multicenter study. Journal ofthe American Medical Association 1993;270:2957–63.

[9] Lemeshow S, Teres D, Klar J, Avrunin JS, Gehlbach SH, Rapoport J. Mortalityprobability models (MPM II) based on an international cohort of intensive careunit patients. Journal of the American Medical Association 1993;270:2478–86.

10] Christakis NA, Lamont EB. Extent and determinants of error in doctors’ prog-noses in terminally ill patients: prospective cohort study. British MedicalJournal 2000;320:469–72.

11] McClish DK, Powell SH. How well can physicians estimate mortality in a medicalintensive care unit? Medical Decision Making 1989;9:125–32.

12] Poses RM, Bekes C, Copare FJ, Scott WE. The answer to “What are my chances,doctor?” depends on whom is asked: prognostic disagreement and inaccuracyfor critically ill patients. Critical Care Medicine 1989;17:827–33.

13] Christakis NA, Iwashyna TJ. Attitude and self-reported practice regarding prog-nostication in a national sample of internists. Archives of Internal Medicine1998;158:2389–95.

14] Scholz N, Basler K, Saur P, Burchardi H, Felder S. Outcome prediction in criticalcare: physicians’ prognoses vs. scoring systems. European Journal of Anaesthe-siology 2004;21:606–11.

15] Garrouste-Orgeas M, Montuclard L, Timsit JF, Misset B, Christias M, Carlet J.Triaging patients to the ICU: a pilot study of factors influencing admissiondecisions and patient outcomes. Intensive Care Medicine 2003;29:774–81.

16] Knaus WA, Harrell Jr FE, Lynn J, Goldman L, Phillips RS, Connors Jr AF, et al. TheSUPPORT prognostic model objective estimates of survival for seriously ill hos-pitalized adults. Study to understand prognoses and preferences for outcomesand risks of treatments. Annals of Internal Medicine 1995;122:191–203.

17] Christensen C, Cottrell JJ, Murakami J, Mackesy ME, Fetzer AS, Elstein AS.Forecasting survival in the medical intensive care unit: a comparison of clin-ical prognoses with formal estimates. Methods of Information in Medicine1993;32:302–8.

18] Marks RJ, Simons RS, Blizzard RA, Browne DR. Predicting outcome in inten-sive therapy units—a comparison of Apache II with subjective assessments.Intensive Care Medicine 1991;17:159–63.

19] Poses RM, Bekes C, Copare FJ, Scott WE. What difference do two days make? Theinertia of physicians’ sequential prognostic judgments for critically ill patients.

Medical Decision Making 1990;10:6–14.

20] Poses RM, Bekes C, Winkler RL, Scott WE, Copare FJ. Are two (inexperienced)heads better than one (experienced) head? Averaging house officers’ pro-gnostic judgments for critically ill patients. Archives of Internal Medicine1990;150:1874–8.

Page 7: Assessing and combining repeated prognosis of physicians and temporal models in the intensive care

ence i

[

[

[

[

[

[

[

[

[

[

[

[

[

[

L. Minne et al. / Artificial Intellig

21] Poses RM, McClish DK, Bekes C, Scott WE, Morley JN. Ego bias, reverseego bias, and physicians’ prognostic. Critical Care Medicine 1991;19:1533–9.

22] Brannen AL, Godfrey LJ, Goetter WE. Prediction of outcome from critical illness.A comparison of clinical judgment with a prediction rule. Archives of InternalMedicine 1989;149:1083–6.

23] Kruse JA, Thill-Baharozian MC, Carlson RW. Comparison of clinical assess-ment with APACHE II for predicting mortality risk in patients admitted toa medical intensive care unit. Journal of the American Medical Association1988;260:1739–42.

24] Sinuff T, Adhikari NK, Cook DJ, Schunemann HJ, Griffith LE, Rocker G, et al.Mortality predictions in the intensive care unit: comparing physicians withscoring systems. Critical Care Medicine 2006;34:878–85.

25] Chang RW, Lee B, Jacobs S, Lee B. Accuracy of decisions to withdraw therapy incritically ill patients: clinical judgment versus a computer model. Critical CareMedicine 1989;17:1091–7.

26] Meyer AA, Messick WJ, Young P, Baker CC, Fakhry S, Muakkassa F, et al.Prospective comparison of clinical judgment and APACHE II score in predict-ing the outcome in critically ill surgical patients. Journal of Trauma 1992;32:

747–53.

27] Toma T, Bosman RJ, Siebes A, Peek N, Abu-Hanna A. Learning predictivemodels that use pattern discovery—a bootstrap evaluative approach appliedin organ functioning sequences. Journal of Biomedical Informatics 2010;43:578–86.

[

[

n Medicine 57 (2013) 111–117 117

28] Toma T, Abu-Hanna A, Bosman RJ. Discovery and inclusion of SOFAscore episodes in mortality prediction. Journal of Biomedical Informatics2007;40:649–60.

29] Toma T, Abu-Hanna A, Bosman RJ. Discovery and integration of univariate pat-terns from daily individual organ-failure scores for intensive care mortalityprediction. Artificial Intelligence in Medicine 2008;43:47–60.

30] de Jonge E, Bosman RJ, van der Voort PH, Korsten HH, Scheffer GJ, de KeizerNF. Intensive care medicine in The Netherlands, 1997–2001. I. Patient pop-ulation and treatment outcome. Nederlands Tijdschrift Voor Geneeskunde2003;147:1013–7.

31] Fretschner R, Bleicher W, Heininger A, Unertl K. Patient data managementsystems in critical care. Journal of the American Society of Nephrology2001;12(Suppl. 17):S83–90.

32] Minne L, Abu-Hanna A, de Jonge E. Evaluation of SOFA-based models for pre-dicting mortality in the ICU: a systematic review. Critical Care 2008;12:R161.

33] Burnham KP, Anderson DR. Model selection and multimodel inference: apractical–theoretic approach. Springer; 2002.

34] Hand DJ. Construction and assessment of classification rules. John Wiley & Sons;1997.

35] Frick S, Uehlinger DE, Zuercher Zenklusen RM. Medical futility: predicting out-come of intensive care unit patients by nurses and doctors—a prospectivecomparative study. Critical Care Medicine 2003;31:456–61.

36] Moons KG, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and pro-gnostic research: what, why, and how? British Medical Journal 2009;338:b375.