20
RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based Question Answering and Inference over Unstructured Information Adam Lally, Sugato Bachi, Michael A. Barborak, David W. Buchanan, Jennifer Chu-Carroll, David A. Ferrucci*, Michael R. Glass, Aditya Kalyanpur, Erik T. Mueller, J. William Murdock, Siddharth Patwardhan, John M. Prager, Christopher A. Welty IBM Research Division Thomas J. Watson Research Center P.O. Box 218 Yorktown Heights, NY 10598 * This work was done while at the IBM Thomas J. Watson Research Center Research Division Almaden – Austin – Beijing – Cambridge – Dublin - Haifa – India – Melbourne - T.J. Watson – Tokyo - Zurich

IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Embed Size (px)

Citation preview

Page 1: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

RC25489 (WAT1409-048) September 17, 2014Computer Science

IBM Research Report

WatsonPaths: Scenario-based Question Answering andInference over Unstructured Information

Adam Lally, Sugato Bachi, Michael A. Barborak, David W. Buchanan,Jennifer Chu-Carroll, David A. Ferrucci*, Michael R. Glass,

Aditya Kalyanpur, Erik T. Mueller, J. William Murdock,Siddharth Patwardhan, John M. Prager, Christopher A. Welty

IBM Research DivisionThomas J. Watson Research Center

P.O. Box 218Yorktown Heights, NY 10598

* This work was done while at the IBM Thomas J. Watson Research Center

Research DivisionAlmaden – Austin – Beijing – Cambridge – Dublin - Haifa – India – Melbourne - T.J. Watson – Tokyo - Zurich

Page 2: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

WatsonPaths: Scenario-based Question Answering andInference over Unstructured Information

Adam Lally1, Sugato Bagchi1, Michael A. Barborak1, David W. Buchanan1,Jennifer Chu-Carroll1, David A. Ferrucci2, Michael R. Glass1, Aditya Kalyanpur1,Erik T. Mueller1, J. William Murdock1, Siddharth Patwardhan1, John M. Prager1

Christopher A. Welty1

1IBM Research and IBM Watson GroupThomas J. Watson Research Center

P.O. Box 218Yorktown Heights, NY 10598

2 This work was done while at the IBM Thomas J. Watson Research Center

Abstract

We present WatsonPathsTM

, a novel system thatcan answer scenario-based questions, for ex-ample medical questions that present a patientsummary and ask for the most likely diagno-sis or most appropriate treatment. WatsonPathsbuilds on the IBM Watson

TMquestion answer-

ing system that takes natural language questionsas input and produces precise answers alongwith accurate confidences as output. Watson-Paths breaks down the input scenario into indi-vidual pieces of information, asks relevant sub-questions of Watson to conclude new informa-tion, and represents these results in a graphi-cal model. Probabilistic inference is performedover the graph to conclude the answer. On a setof medical test preparation questions, Watson-Paths shows a significant improvement in accu-racy over the base Watson QA system. We alsodescribe how WatsonPaths can be used in a col-laborative application to help users reason aboutcomplex scenarios.

1 IntroductionIBM Watson

TMis a question answering system that takes

natural language questions as input and produces preciseanswers along with accurate confidences as output (Fer-rucci et al., 2010). Watson defeated two of the best hu-man players on the quiz show Jeopardy! in 2011.

Watson has been described (Kelly and Hamm, 2013)as an opening to the era of cognitive computing: com-puters that interact in a natural way with humans, assist

human cognition, and learn and improve from interac-tion. To fulfill this vision, further advances are required.One such advance is the ability to answer more complexquestions. Another is to allow the user to understand andparticipate in the question answering process.

Consider the following questions, one from medicineand one from taxation:

A 32-year-old woman with type 1 diabetesmellitus has had progressive renal failure. Herhemoglobin concentration is 9 g/dL. A bloodsmear shows normochromic, normocytic cells.What is the problem?

I inherited real-estate from a relative who died5 years ago via a trust that was created beforehis death. The property was sold this year afterdissolution of the trust, and the money was putin a Roth-IRA. Which tax form(s) do I need tofile?

We asked domain experts to describe their approach tosolving such questions. An example from medical ex-perts is shown in Figure 1. Many drew a graph of initialsigns and symptoms leading to their most likely possiblecauses and connecting them to a final conclusion. We no-ticed that their reasoning process often resembled proba-bilistic inference.

At the core of Watson’s question answering is a suiteof algorithms that match passages containing candidateanswers to the original question. These algorithms havebeen described in a series of articles (Chu-Carroll et al.,2012; Ferrucci, 2012; Gondek et al., 2012; Lally et al.,2012; McCord et al., 2012; Murdock et al., 2012a; Mur-dock et al., 2012b). But, when questions involve complex

Page 3: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Patient has renal failurePatient’s hemoglobin conc.

Is 9 g/dL [low]

Patient’s blood smear shows

normocytic cells

Patient has anemia

Evidence: “Low

hemoglobin conc.

indicates anemia.”

Evidence: “Erythropoietin is

produced in the kidneys.”

“A 32-year-old woman with type 1 diabetes mellitus has had progressive renal failure…

Her hemoglobin concentration is 9 g/dL... A blood smear shows normochromic,

normocytic cells. What is the problem?

Most likely cause of low hemoglobin conc.

is Erythropoietin deficiency

Patient has normocytic anemia

Evidence: “Normocytic anemia is

a type of anemia with normal

red blood cells.”

Evidence:

“Erythropoietin

deficiency is a cause of

normocytic anemia.”

Patient is at risk for

Erythropoietin deficiency

Figure 1: Simple Diagnosis Graph for a Patient with Erythro-poietin Deficiency

scenarios, as in the above examples, passage matchingby itself is often insufficient to locate the answer. Thisis because scenario-based question answering requiresintegrating and reasoning over information from multi-ple sources. Furthermore, we must often apply generalknowledge to a specific case, as in a medical scenarioabout a patient.

In this paper, we present a new approach that builds onWatson’s strengths and is in line with the human reason-ing process we observed. We break down the input sce-nario into individual pieces of information, ask relevantsubquestions to conclude new information, and combinethese results into an assertion graph. We then performprobabilistic inference over the graph to conclude the an-swer to the overall question. This process is repeated toextend the graph until a stopping condition is met. Be-cause we use Watson to answer the subquestions, and be-cause we attempt to construct paths of inference to a finalanswer, we call our system WatsonPaths

TM.

In the WatsonPaths graph, the evidence is drawn froma variety of sources including general knowledge ency-clopedias, domain-specific books, structured knowledgebases, and semi-structured knowledge bases. We weremotivated by the desire to design a solution that couldharness Watson, and we observed that each edge in thisgraph could correspond to a question asked of Watson.

An added dimension to WatsonPaths is the ability tointeract with the user. The original Watson system thatwon Jeopardy! was largely non-interactive. For many ap-plications, it is important to engage the user in the prob-lem solving process. WatsonPaths has the ability to elicit

user input at multiple points of question answering anddecision making to clarify questions, to judge evidence,and to ask new questions. A key advantage of this ap-proach is that user feedback can be used as training datato improve both Watson and WatsonPaths.

2 WatsonPaths Medical Use CaseAlthough WatsonPaths enables general-purposescenario-based question answering, we decided tostart by focusing our attention on the medical domain.We focused on the problem of patient scenario analysis,where the goal is typically a diagnosis or a treatmentrecommendation.

To explore this kind of problem solving, we obtaineda set of medical test preparation questions. These aremultiple choice medical questions based on an unstruc-tured or semi-structured natural language description ofa patient. Although WatsonPaths is not restricted to mul-tiple choice questions, we saw multiple choice questionsas a good starting point for development. Many of thesequestions involve diagnosis, either as the entire question,as in the previous medical example, or as an intermediatestep, as in the following example:

A 63-year old patient is sent to the neurologistwith a clinical picture of resting tremor that be-gan 2 years ago. At first it was only on the lefthand, but now it compromises the whole arm.At physical exam, the patient has an unexpres-sive face and difficulty in walking, and a con-tinuous movement of the tip of the first digitover the tip of the second digit of the left handis seen at rest. What part of his nervous systemis most likely affected?

For this question, it is useful to diagnose that the patienthas Parkinson’s disease before determining which part ofhis nervous system is most likely affected. These multi-step inferences are a natural fit for the graphs that Wat-sonPaths constructs. In this example, the diagnosis is themissing link on the way to the final answer.

3 Scenario-based Question AnsweringIn scenario-based question answering, the system re-ceives a scenario description that ends with a punchlinequestion. For instance, the punchline question in theParkinson’s example is “What part of his nervous systemis most likely affected?” Instead of treating the entirescenario as one monolithic question as would Watson,WatsonPaths explores multiple facts in the scenario inparallel and reasons with the results of its exploration asa whole to arrive at the most likely conclusion regardingthe punchline question. The architecture of WatsonPathsis shown in Figure 2.

Page 4: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Scenario

Analysis

Assertion

Graph

Input Scenario

Node

Prioritization

Relation (Edge)

Generation

(may ask questions

to Watson)Repeat until

“completion”

(which may be

Estimate

Confidences

In Nodes

(“Belief Engine”)

(which may be

defined in

different ways)

Hypothesis

Identification

Hypothesis

Confidence

Refinement

(learned model)

Final

Confidences in

Hypotheses

Figure 2: Scenario-based Question Answering Architecture

3.1 Scenario Analysis

The first step in the pipeline is scenario analysis, wherewe identify factors in the input scenario that may beof importance. In the medical domain, the factorsmay include demographics (“32-year old woman”), pre-existing conditions (“type 1 diabetes mellitus”), signsand symptoms (“progressive renal failure”), and testresults (“hemoglobin concentration is 9 g/dL,” “nor-mochromic cells,” “normocytic cells”). The extractedfactors become nodes in a graph structure called the as-sertion graph. The assertion graph structure is definedin Section 4, while more details of the scenario analysisprocess are given in Section 5.

3.2 Node Prioritization

The next step is node prioritization, where we decidewhich nodes in the graph are most important for solv-ing the problem. In a small scenario like this example,we may be able to explore everything, but in general thiswill not be the case. Factors that affect the priority of anode may include the system’s confidence in the node as-sertion or the system’s estimation of how fruitful it wouldbe to expand a node. For example, normal test resultsand demographic information are generally less usefulfor starting a diagnosis than symptoms and abnormal testresults.

3.3 Relation GenerationThe relation generation step, which is described in moredetail in Section 6, builds the assertion graph. We dothis primarily by asking Watson questions about the fac-tors. In medicine we want to know the causes of the find-ings and abnormal test results that are consistent with thepatient’s demographic information and normal test re-sults. Given the scenario in the Introduction, we couldask, “What does type 1 diabetes mellitus cause?” Weuse a medical ontology to guide the process of formu-lating subquestions to ask Watson. Relevant factors mayalso be combined to form a single, more targeted ques-tion. Because in this step we want to emphasize recall,we take several of Watson’s highly-ranked answers. Theexact number of answers taken, or the confidence thresh-old, are parameters that must be tuned. Given a set ofanswers, we add them to the graph as nodes, with edgesfrom nodes that were used in questions to nodes that wereanswers. The edge is labeled with the relation used toformulate the question (like causes or indicates), and thestrength of the edge is initially set to Watson’s confidencein the answer. Although Watson is the primary way weadd edges to the graph, WatsonPaths allows for any num-ber of relation generator components to post edges to thegraph.

3.4 Belief ComputationOnce the assertion graph has been expanded in this way,we recompute the confidences of nodes in the graphbased on new information. We do this using probabilis-tic inference systems that are described in Section 7. Theinference systems take a holistic view of the assertiongraph and try to reconcile the results of multiple paths ofexploration.

3.5 Hypothesis IdentificationAs Figure 2 shows, this process can go through multipleiterations, during which the nodes that were the answersto the previous round of questions can be used to ask thenext round of questions, producing more nodes and edgesin the graph. After each iteration we may do hypothesisidentification, where some nodes in the graph are identi-fied as potential final answers to the punchline question(for example, the most likely diagnoses of a patient’sproblem). In some situations hypotheses may be pro-vided up front—a physician may have a list of competingdiagnoses and want to explore the evidence for each. Butin general the system needs to identify these. Hypothesisnodes may be treated differently in later iterations. Forinstance, we may attempt to do backward chaining fromthe hypotheses, asking Watson what things, if they weretrue of the patient, would support or refute a hypothesis.The process may terminate after a fixed number of itera-tions or based on some other criterion like confidence in

Page 5: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

the hypotheses.While hypothesis identification is part of WatsonPaths,

it is not described in detail in this paper. In the systemthat generates the results we present in Section 10, no hy-pothesis identification is necessary because the multiplechoice answers are provided. That system always doesone iteration of expansion, both forward from the iden-tified factors and backward from the hypotheses, beforestopping.

3.6 Hypothesis Confidence RefinementAs described so far, WatsonPath’s confidence in each hy-pothesis depends on the strengths of the edges leadingto it, and since our primary relation (edge) generator isWatson, the hypothesis confidence depends heavily onthe confidence of Watson’s answers. Having good an-swer confidence depends on having a representative setof question/answer pairs with which to train Watson. Thefollowing question arises: What can we do if we do nothave a representative set of question/answer pairs, butwe do have training examples for entire scenarios (e.g.,correct diagnoses associated with patient scenarios)? Toleverage the available scenario-level ground truth, wehave built machine learning techniques to learn a refine-ment of Watson’s confidence estimation that producesbetter results when applied to the entire scenario. Thislearning process is discussed in Section 8.

3.7 Collaborating with the UserWatsonPaths can run in a completely automated way, asthe Watson question answering system did when playingJeopardy! (This is the case for the results presented inSection 10.) But there are also many interesting possi-bilities for user interaction at each step in the process.In this way, WatsonPaths exemplifies cognitive comput-ing. Our vision for cognitive computing is that the userand the computer work together to explore a scenario andreach conclusions faster and more accurately than eithercould do alone. We discuss the collaborative learning as-pects of WatsonPaths in Section 9.

4 Assertion GraphsThe core data structure used by WatsonPaths is the asser-tion graph. Figure 3 explains this data structure, alongwith the visualization that we commonly use for it. As-sertion graphs are defined as follows.

A statement is something that can be true or false(though its state may not be known). Often we deal withunstructured statements, which are natural language ex-pressions like “A 63-year-old patient is sent to the neurol-ogist with a clinical picture of resting tremor that began 2years ago.” WatsonPaths also allows for statements thatare structured expressions, namely, a predicate and argu-ments. Not all natural language expressions can have a

indicates

patient’s Substantia

Nigra is affected

patient has Parkinson’s

Disease

A 63-year-old patient is sent to the neurologist with ... resting tremor ... What part of his nervous system is most likely affected?

patient exhibits resting tremor

An edge represents a relation between the connected statements. Agents make assertions about the truth of these relations with confidences. Edge width represents that confidence. Gray level represents the amount of belief flow.

states

indicates

A node represents a statement. Types of statements are input factors, inferred factors and hypotheses or answers. Border strength visually represents “belief” the factor is true in context.

Input Factor

Inferred Factor

Hypothesis

Scenario

Relation

Assertion Graph

Figure 3: Visualization of an Assertion Graph. By convention,input factors are placed at the top and hypotheses at the bottomwith levels of inference factors in between.

truth value. For instance, the string “patient” cannot betrue or false; thus it does not fit into the semantics of anassertion graph. WatsonPaths is charitable in interpret-ing strings as if they had a truth value. For instance, thedefault semantics of the string “low hemoglobin” is thesame as “patient has low hemoglobin.”

A relation is a named association between statements.Technically, relations are themselves statements, andhave a truth value. Each relation has a predicate; for in-stance in medicine we may say that “Parkinsons causesresting tremor” or “Parkinson’s matches Parkinsonism.”Typically we are concerned with relations that may pro-vide evidence for the truth of one statement given an-other. Although some relations may have special mean-ings in the probabilistic inference systems, a common se-mantics for a relation is indicative in the following way:“A indicates B” means that the truth of A provides anindependent reason to believe that B is true. Section 7provides more detail on the inference systems.

An assertion is a claim that some agent makes aboutthe truth of a statement (including a relation). The as-sertion records the name of the agent and a confidencevalue. Assertions may also record provenance informa-tion that explains how the agent came to its conclusion.

Page 6: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

For the Watson question answering agent, this includesnatural language passages that provide evidence for theanswer. When the system is collaborating with a user, itis crucial to be able to display evidence to the user.

In the assertion graph, each node represents exactlyone statement, and each edge represents exactly one re-lation. Nodes and edges may have multiple assertions at-tached to them, one for each agent that has asserted thatnode or edge to be true.

We often visualize assertion graphs by using a node’sborder width to represent the confidence of the node, anedge’s width to represent the confidence of the edge, andan edge’s gray level as the amount of “belief flow” alongthat edge. Belief flow is described later, but essentially itis how much the value of the head influences the value ofthe tail. This depends mostly on the confidences of theassertions on the edge.

5 Scenario AnalysisThe goal of scenario analysis is to identify informationin the natural language narrative of the problem sce-nario that is potentially relevant to solving the problem.When human experts read the problem narrative, they aretrained to extract concepts that match a set of seman-tic types relevant for solving the problem. In the med-ical domain, doctors and nurses identify semantic typeslike chief complaints, past medical history, demograph-ics, family and social history, physical examination find-ings, labs, and current medications (Bowen, 2006). Ex-perts also generalize from specific observations in a par-ticular problem instance to more general terms used inthe domain corpus. An important aspect of this informa-tion extraction is to identify the semantic qualifiers asso-ciated with the clinical observations (Chang et al., 1998).These qualifiers could be temporal (e.g.,“pain started twodays ago”), spatial (“pain in the epigastric region”), orother associations (“pain after eating fatty foods”). Im-plicit in this task is the human’s ability to extract conceptsand their associated qualifiers from the natural languagenarrative. For example, the above qualifiers might haveto be extracted from the sentence “The patient reportspain, which started two days ago, in the epigastric regionespecially after eating fatty foods.”

The computer system needs to perform a similar anal-ysis of the narrative. We use the term factor to denote thepotentially relevant observations along with their associ-ated semantic qualifiers. Reliably identifying and typingthese factors, however, is a difficult task, because medi-cal terms are far more complex than the kind of namedentities typically studied in natural language processing.Our scenario analytics pipeline attempts to address thisproblem with the following major processing steps:

1. The analysis starts with syntactic parsing of the nat-

ural language. This creates a dependency tree ofsyntactically linked terms in a sentence and helps toassociate terms that are distant from each other inthe sentence.

2. The terms are mapped to a dictionary to iden-tify concepts and their semantic types. For themedical domain, our dictionary is derived fromthe UMLS Metathesaurus (National Library ofMedicine, 2009), Wikipedia redirects, and medicalabbreviation resources. The concepts identified bythe dictionary are then typed using the UMLS Se-mantic Network, which consists of a taxonomy ofbiological and clinical semantic types like Anatomy,SignOrSymptom, DiseaseOrSyndrome, and Ther-apeuticOrPreventativeProcedure. In addition tomapping the sequence of tokens in a sentence tothe dictionary, the dependency parse is also usedto map syntactically linked terms. For example“. . . stiffness and swelling in the arm and leg” canbe mapped to the four separate concepts containedin that phrase.

3. The syntactic and semantic information identifiedabove are used by a set of predefined rules to iden-tify important relations. Negation is commonlyused in clinical narratives and needs to be accuratelyidentified. Rules based on parse features identify thenegation trigger term and its scope in a sentence.Factors found within the negated scope can then beassociated with a negated qualifier. Another exam-ple of rule-based annotation is lab value analysis.This associates a quantitative measurement to thesubstance measured and then looks up reference labvalue ranges to make a clinical assessment. For ex-ample “hemoglobin concentration is 9 g/dL” is pro-cessed by rules to extract the value, unit, and sub-stance and then assessed to be “low hemoglobin”by looking up a reference. Next, the clinical assess-ment is mapped by the dictionary to the correspond-ing clinical concept.

At this point, we should have all the information toidentify factors and their semantic qualifiers. We haveto contend, however, with language ambiguities, errorsin parsing, a noisy and non-comprehensive dictionary,and a limited set of rules. If we were to rely solely ona rule-based system, then the resulting factor identifica-tion would suffer from a compounding of errors in thesecomponents. To address this issue, we employ machinelearning methods to learn clinical factors and their se-mantic qualifiers in the problem narrative. We obtainedthe ground truth by asking medical students to annotateclinical factor spans and their semantic types. They alsoannotated semantic qualifier spans and linked them to

Page 7: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

factors as attributive relations.The machine learning system is comprised of two se-

quential steps:

1. A conditional random field (CRF) model (Laffertyet al., 2001) learns the spans of text that should bemarked as one of the following factor types: finding,disease, test, treatment, demographics, negation, ora semantic qualifier. Features used for training theCRF model are lexical (lemmas, morphological in-formation, part-of-speech tags), semantic (UMLSsemantic types and groups, demographic and labvalue annotations), and parse-based (features asso-ciated with dependency links from a given token).A token window size of 5 (2 tokens before and af-ter) is used to associate features for a given token. ABIO tagging scheme is used by the CRF to identifyentities in terms of their token spans and types.

2. A maximum entropy model then learns the relationsbetween the entities identified by the CRF model.For each pair of entities in a sentence, this modeluses lexical features (within and between entities),entity type, and other semantic features associatedwith both entities, and parse features in the depen-dency path linking them. The relations learned bythis model are negation and attributeOf relationslinking negation triggers and semantic qualifiers (re-spectively) to factors.

The combined entity and relation identification modelshave a precision of 71% and recall of 65% on a blindevaluation set of patient scenarios found in medical testpreparation questions. We are currently exploring jointinference models and identification of relations that spanmultiple sentences using coreference resolution.

6 Relation GenerationThe scenario analysis component described in the previ-ous section extracts pertinent factors related to the patientfrom the scenario description. At this stage, the assertiongraph consists of the full scenario, individual scenariosentences, and the extracted factors. An indicates rela-tion is posted from a source node (e.g., a scenario sen-tence node) to a target node whose assertion was derivedfrom the assertion in the source node (e.g., a factor ex-tracted from that sentence). In addition, a set of hypothe-ses, if given, are posted as goal nodes in the assertiongraph.

The task of the relation generation component is to (1)expand the graph by inferring new facts from known factsin the graph and (2) identify relationships between nodesin the graph (like matches and contraindicates) to helpwith reasoning and confidence estimation. We begin bydiscussing how we infer new facts for graph expansion.

6.1 Expanding the Graph with WatsonIn medical problem solving, experts reason with chiefcomplaints, findings, medical history, demographic in-formation, and so on, to identify the underlying causesfor the patient’s problems. Depending on the situation,they may then proceed to propose a test whose resultswill allow them to distinguish between multiple possi-ble problem causes, or identify the best treatment for theidentified cause, and so on.

Motivated by the medical problem solving paradigm,WatsonPaths first attempts to make a diagnosis based onfactors extracted from the scenario. The graph is ex-panded to include new assertions about the patient byasking questions of a version of the Watson question an-swering system adapted for the medical domain (Ferrucciet al., 2013). WatsonPaths takes a two-pronged approachto medical problem solving by expanding the graph for-ward from the scenario in an attempt to make a diagnosis,and then linking high confidence diagnoses with the hy-potheses. The latter step is typically done by identifyingan important relation expressed in the punchline question(e.g., “What is the most appropriate treatment for this pa-tient” or “What body part is most likely affected?”). Thisapproach is a logical extension of the open-domain workof Prager et al. (2004), where in order to build a profileof an entity, questions were asked of properties of the en-tity and constraints between the answers were enforcedto establish internal consistency.

The graph expansion process of WatsonPaths beginswith automatically formulating questions related to highconfidence assertions, which in our graphs representstatements WatsonPaths believes to be true to a certaindegree of confidence about the patient. These statementsmay be factors, as extracted and typed by the algorithmdescribed in Section 5, or combinations of those factors.

To determine what kinds of questions to ask, Watson-Paths can use a domain model that tells us what relationsform paths between the semantic type of a high confi-dence node and the semantic type of a hypothesis like adiagnosis or treatment. For the medical domain, we cre-ated a model that we called the Emerald, which is shownin Figure 4. (Notice the resemblence to an emerald.) TheEmerald is a small model of entity types and relationsthat are crucial for diagnosis and for formulating nextsteps.

We select from the Emerald all relations that link thesemantic type of a high-confidence source node to asemantic type of interest. The relations and the high-confidence nodes then form the basis of instantiating thetarget nodes, thereby expanding the assertion graph. Toinstantiate the target nodes, we issue WatsonPaths sub-questions to Watson. All answers returned by Watsonthat score above a pre-determined threshold are postedas target nodes in the inference graph. A relation edge

Page 8: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Figure 4: The Emerald

resting tremor that began 2 years ago

At physical exam, the patient has an

unexpressive face and difficulty in walking,

and...

At first it was only the left hand, but now it

compromises the whole arm

A 63-year old patient is sent to the neurologist with a clinical picture of

resting tremor that began 2 years ago

unexpressive face

compromises the whole arm

difficulty in walking

Substantia Nigra Cerebellum Caudate

NucleusPonsLenticular Nuclei

Parkinson’s disease

Huntington’s disease

Progressive supranuclear

palsyCerebellar diseases

Parkinson disease

Diffuse lewy body disease

Figure 5: WatsonPaths Graph Expansion Process

is posted from the source node to each new target nodewhere the confidence of the relation is Watson’s confi-dence in the answer in the target node.

In addition to asking questions from scenario factors,WatsonPaths may also expand backwards from hypothe-ses. The premise for this approach is to explore how ahypothesis fits in with the rest of the inference graph. Ifone hypothesis is found to have a strong relationship withan existing node in the assertion graph, then the proba-bilistic inference mechanisms described in 7 allow beliefto flow from known factors to that hypothesis, thus in-creasing the system’s confidence in that hypothesis.

Figure 5 illustrates the WatsonPaths graph expansionprocess. The top two rows of nodes and the edges be-tween them show a subset of the WatsonPaths assertiongraph after scenario analysis, with the second row ofnodes representing some clinical factors extracted from

the scenario sentences.The graph expansion process identifies the most confi-

dent assertions in the graph, which include the four clin-ical factor nodes extracted from the scenario. These fournodes are all typed as findings, so they are aggregatedinto a single finding node for the purpose of graph expan-sion. For a finding node, the Emerald proposes a singlefindingOf relation that links it to a disease. This results inthe formulation of the subquestion “What disease causesresting tremor that began 2 years ago, compromises thewhole arm, unexpressive face, and difficulty in walk-ing?” whose answers include Parkinson disease, Hunt-ington’s disease, cerebellar disease, and so on. Theseanswer nodes are added to the graph and some of themare shown in the third row of nodes in Figure 5.

In the reverse direction, WatsonPaths explores rela-tionships between hypotheses to nodes in the existinggraph based on the punchline question in the scenario,which in this case is “What part of his nervous system ismostly likely affected?” Assuming each hypothesis to betrue, the system formulates subquestions to link it to theassertion graph. Consider Substantia nigra. WatsonPathscan ask “In what disease is substantia nigra most likelyaffected?” A subset of the answers to this question, in-cluding Parkinson’s disease and Diffuse Lewy body dis-ease are shown in the fourth row of nodes in Figure 5.

6.2 Matching Graph NodesWhen a new node is added to the WatsonPaths asser-tion graph, we compare the assertion in the new nodeto those in existing nodes to ensure that equivalence rela-tions between nodes are properly identified. This is doneby comparing the statements in those assertions: for un-structured statements, whether the statements are lexi-cally equivalent, and for structured statements, whetherthe predicates and their arguments are the same. A morecomplex operation is to identify when nodes contain as-sertions that may be equivalent to the new assertion.

We employ an aggregate of term matchers (Murdocket al., 2012a) to match pairs of assertions. Each termmatcher posts a confidence value on the degree of matchbetween two assertions based on its own resource for de-termining equivalence. For example, a WordNet-basedterm matcher considers terms in the same synset to beequivalent, and a Wikipedia-redirect-based term matcherconsiders terms with a redirect link between them inWikipedia to be a match. The dotted line betweenParkinson disease and Parkinson’s disease in Figure 5is posted by the UMLS-based term matcher, which con-siders variants for the same concept to be equivalent.

7 Confidence and BeliefOnce the assertion graph is constructed, and some ques-tions and answers are posted, there remains the problem

Page 9: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

of confidence estimation. We develop multiple models ofinference to address this step.

7.1 Belief Engine

One approach to the problem of inferring the correct hy-pothesis from the assertion graph is probabilistic infer-ence over a graphical model (Pearl, 1988). We refer tothe component that does this as the belief engine.

Although the primary goal of the belief engine is to in-fer confidences in hypotheses, it also has two secondarygoals. One is to infer belief in unknown nodes that arenot hypotheses. These intermediate nodes may be im-portant intermediate steps toward an answer; by assign-ing high confidences to them in the main loop, we knowto assign them high priority for subquestion asking. An-other secondary goal is to support the user interface (seeSection 9). Among inference algorithms that performwell in terms of accuracy and other metrics, we try tomake choices that will make the flow of belief intuitivefor users. This facilitates the gathering of better oppor-tunistic annotations, which improves future performance.

To execute the belief engine, we first make a work-ing copy of the assertion graph that we call the inferencegraph. A separate graph is used so that we can makechanges without losing information that might be use-ful in later steps of inference. For instance, we mightchoose to merge nodes or reorient edges. Once the infer-ence graph has been built, we run a probabilistic infer-ence engine over the graph to generate new confidences.Each node represents an assertion, so it can be in one oftwo states: true or false (“on” or “off”). Thus a graphwith k nodes can be in 2k possible states. The inferencegraph specifies the likelihoods of each of these states.The belief engine uses these likelihoods to calculate themarginal probability, for each node, of it being in thetrue state. This marginal probability is treated as a confi-dence. Finally, we read confidences and other data fromthe inference graph back into the assertion graph.

There are some challenges in applying probabilistic in-ference to an assertion graph. Most tools in the infer-ence literature were designed to solve a different prob-lem, which we will call the classical inference problem.In this problem, we are given a training set and a test setthat can be seen as samples from a common joint distri-bution. The task is to construct a model that captures thetraining set (for instance, by maximizing the likelihoodof the training set), and then apply the model to predictunknown values in the test set. Arguably the greatestproblem in the classical inference task is that the struc-ture of the graphical model is underdetermined; a largespace of possible structures needs to be explored. Oncea structure is found, adjusting the strengths is relativelyeasier, because we know that samples from the trainingset are sampled from a consistent joint distribution.

In WatsonPaths, we face a different set of problems.The challenge is not to construct a model from trainingdata, but to use a very noisy, already constructed model todo inference. Training data in the classical sense is absentor very sparse; all we have are correct answers to somescenario-level questions. An advantage is that a graphstructure is given. A disadvantage is that the graph isnoisy. Furthermore, it is not known that the confidenceson the edges necessarily correspond to the optimal edgestrengths. (In the next section, we address the problemof learning edge strengths.) Thus we have the problemof selecting a semantics—a way to convert the assertiongraph into a graph over which we can do optimal proba-bilistic inference to meet our goals.

After much experimentation, the primary semanticsused by the belief engine is the indicative semantics: Ifthere is a directed relation from node A to node B withstrength x, then A provides an independent reason to be-lieve that B is true with probability x. Some edges areclassified as contraindicative; for these edges, A pro-vides an independent reason to believe that B is falsewith probability x. The independence means that multi-ple parents R can easily be combined using a noisy-OR:

(1−∏r∈R

(1− r)) =⊕r∈R

r

The graph, so interpreted, forms a noisy-logicalBayesian network (Yuille and Lu, 2007). The strengthof each edge can be interpreted as an indicative power, aconcept related to causal power (Cheng, 1997), with thedifference that we are semantically agnostic as to the truedirection of the causal relation. Formally, the probabilityof a node being “on” (true) is given by

P (x|Rx, Qx) =

[ ⊕r∈Rx

(srpr)

] 1−⊕

q∈Qx

(sqpq)

where P (x) is the probability of node x being on, Rx

is the set of indicative parents of x, and Qx is the set ofcontraindicative parents. The parent’s state is representedby pr: 1 if the parent is on, and 0 otherwise. The value sr

represents the strength of the edge from the parent to x.In other words, the probability that a node x is on is thenoisy-OR of its active indicative parent edge strengthscombined via a noisy-AND-NOT with the noisy-OR ofits active contraindicative parent edge strengths.

For instance, if the node resting tremor indicatesParkinson disease with strength 0.8, and the node diffi-culty in walking indicates Parkinson disease with power0.4, then the probability of Parkinson disease will be(1 − (1 − 0.8)(1 − 0.4)) = 0.88. If so, then the edgewith strength 0.9 to Parkinson’s disease will fire with

Page 10: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

probability 0.88 · 0.9 = 0.792. In this way, probabil-ities can often multiply down simple chains. Inferencemust be more sophisticated to handle the graphs we seein practice, but the intuition is the same.

An example that adds sophistication to the inference isan “exactly one” constraint that can be optionally addedto multiple-choice questions. This constraint assigns ahigher likelihood to assignments in which exactly onemultiple choice answer is true. Because of these kindsof constraints, and because of the fact that the graphscontain directed and undirected cycles, we cannot sim-ply calculate the probabilities in a feed-forward manner.To perform inference we use Metropolis-Hastings sam-pling over a factor graph representation of the inferencegraph. This has the advantage of being a very generalapproach—the inference engine can easily be adapted toa new semantics—and also allows an arbitrary level ofprecision given enough processing time.

Users and annotators report that they find the indica-tive semantics intuitive, and it performs at least as well asother semantics in experiments. One of the first seman-tics we tried was undirected pairwise Markov randomfields. These performed poorly in practice. We hypoth-esize that this is because important information is con-tained in the direction of the edges that Watson returns:Asking about A and getting B as an answer is differentfrom asking about B and getting A as an answer. Anundirected model loses this information.

The indicative semantics is a default, basic semantics.The ability to reason over arbitrary relations makes theindicative semantics robust, but it is easy to construct ex-amples in which the indicative semantics is not strictlycorrect. For instance, “fever is a finding of Lyme disease”may be correctly true with high confidence, but this doesnot mean that fever provides an independent reason tobelieve that Lyme disease is present, with high probabil-ity. Fever is caused by many things, each of which couldexplain it. We are currently working on adding a causalsemantics in which we use a noisy-logical Bayesian net-work, but belief flows from causes to effects, rather thanfrom factors to hypotheses. Edges are oriented accord-ing to the types of the nodes: Diseases cause findings butnot vice-versa. Currently this does not lead to detectableimprovement in accuracy and we expect that we need toimprove the precision of the rest of the system before itwill show impact.

7.2 Closed-Form Inference

The inference method in Section 7.1 obtains the strengthsof edges directly from the confidence values on Watson’sanswers to subquestions, which depend on having a rep-resentative training set of subquestion/answer pairs. Wehave also developed inference methods where each edgehas a feature vector (produced by Watson or any other re-

lation generator) and we express the confidence in eachhypothesis as a closed-form, parameterized expressionover the feature values. We can then optimize the param-eters on a training set of scenarios and correct diagnoses(see Section 8).

To illustate the idea we describe in detail one suchmodel, the Noisy-OR Model, which is based on the sameintuition is the indicative semantics just described.

We first convert the assertion graph to a directedacyclic graph (DAG). The assertion graph is not, in gen-eral, free of cycles. Additionally, the assertion graph con-tains matching relations, which are undirected. To form aDAG, the nodes in the assertion graph are first clusteredby these matching relations, and then cycles are brokenby applying heuristics to re-orient edges to point fromfactors to hypotheses.

The confidence in factors extracted by Scenario Anal-ysis is 1.0. For all other nodes the confidence is definedrecursively in terms of the confidences of the parents andthe confidence of the edges produced by the QA system.Let the set of parents in the DAG for a node n be givenby a(n). The feature vector the QA system gives forone node, m, indicating another, n, is given by λ(m,n).Then the confidence for a non-factor node is given below.The learned weight vector for the QA features is ~q.

P (n) =⊕

ai∈a(n)

σ(~q · λ(ai, n)) · P (ai)

where σ(x) denotes the sigmoid function.

σ(x) =1

1 + e−x

The noisy-OR combination is most sensible when thesources of evidence are independent. When there aretwo edges leading from the same inference node (whichmay be the result of merging two or more assertion graphnodes) to the node under consideration, these edges arecombined by max rather than noisy-OR.

In addition to the Noisy-OR model, we have also de-veloped the following:

• The Edge Type variant of the Noisy-OR model con-siders the type of the edge when propagating con-fidence from parents to children. The strength ofthe edge according to the QA model is multipliedby a per-edge-type learned weight, then a sigmoidfunction is applied. In this model, different types ofsub-questions may have different influence on con-fidences, even when the QA model produces similarfeatures for them.

• The Matching Model estimates the confidence ina hypothesis according to how well each factor in

Page 11: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

the scenario, plus the answers to forward questionsasked about it, match against either the hypothe-sis or the answers to the backward questions askedfrom it. We estimate this degree of match using theterm matchers described Section 6.2.

• The Feature Addition Model uses the same DAG asthe Noisy-OR model, but confidence in the interme-diate nodes is computed by adding the feature val-ues for the questions that lead to it and then applyingthe logistic model to the resulting vector. An effectis that the confidence for a node does not increasemonotonically with the number of parents. Instead,if features that are negatively associated with cor-rectness are present in one edge, it can lower theconfidence of the node below the confidence givenby another edge.

• The Causal Model attempts to capture causal se-mantics by expressing the confidence for each can-didate as the product over every clinical factor of theprobability that either the diagnosis could explainthe factor (as estimated from Watson/QA features),or the factor “leaked” - it is an unexplained obser-vation or is not actually relevant.

In the closed-form inference systems described, thereis no constraint that the answer confidences sum to one.We implement a final stage where features based on theraw confidence from the inference model are transformedinto a proper probability distribution over the candidateanswers.

8 Learning over Assertion GraphsInference methods like those described in the previoussection depend on the strengths of edges created fromthe answers to subquestions. WatsonPaths uses super-vised machine learning to learn these edge strengths fromtraining data. There are two different kinds of trainingdata that we can employ:

• Scenario question training data includes completescenarios, questions about those scenarios, and an-swers to those questions (e.g., a detailed descriptionof a patient, a question asking what is wrong withthe patient, and the correct diagnosis).

• Subquestion training data includes simpler, atomicquestions and answers to those questions (e.g.,“What diseases cause joint pain?” and some an-swers to that question).

The WatsonPaths process that we have described up tothis point assumes that we first train a subquestion an-swering model using subquestion training data and usethe outputs of that model as confidences for inference.

But we have found that this approach has limitations,due in part to issues with our existing subquestion train-ing data (described in Section 10). This approach alsosuffers from the limitation that it only uses informationinternal to the subquestion answering system. Some in-ference methods have parameters that are not based onsubquestions. For example, some approaches develop amodel for the degree that two nodes match or the im-portance of a given node. A simple baseline for nodeimportance is to give all nodes either equal weight or aweight based on a single simple statistic like IDF (inversedocument frequency). A simple model for matching cantake the confidence from a single term matcher thoughtto be generally effective. Graph-based features like thesecan be useful in combination with subquestion answeringfeatures for inference model learning.

Therefore, we have added a final step to the processthat makes use of the scenario question training data. Us-ing the assertion graphs that WatsonPaths has built foreach scenario question, our goal is to learn a model thatproduces a probability distribution over answers with asmuch of the mass as possible concentrated on the correctanswer. This learning is challenging because each asser-tion graph contains very different nodes and edges fromthe others, even different numbers of nodes and edges.Fortunately, the edges in these graphs do share a com-mon set of features, such as question answering features,matching features, and node type features.

A complication is that Watson has a large quantityof question answering features, and many of them havesimilar purposes; e.g., many features independently as-sess whether the answer has the desired type (Murdocket al., 2012b). There is a subtle and indirect connectionbetween the behavior of the subquestion answering sys-tem and the final answers to scenario question trainingdata; this makes it very hard for a learning system us-ing only the scenario question training data to find an ef-fective model over so many features. Consequently, weemploy a hybrid approach. We partition features into asmall number of groups with similar purposes and weuse subquestion training data to build a separate modelfor each group. The outputs of these models representa consolidated set of question answering features (withone score for each group). We then use this consolidatedset of question answering features as features for learninginference models (along with the additional graph-basedfeatures).

8.1 Direct Learning

We have explored several methods for transforming anassertion graph a into a function mapping the valuesof the weights to confidence in the correct hypothesisΦa : Rn → R. The methods in Section 7.2 providefast, exact inference. These approaches permit express-

Page 12: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

ing the confidence in the correct answer as a closed-formexpression. Summing the log of the confidence in thecorrect hypothesis across the training set T , we constructa learning problem with log-likelihood in the correct finalanswer as our objective function. The result is a functionthat is non-convex, and in some cases (due to max) notdifferentiable in the parameters.

To limit overfitting and encourage a sparse, inter-pretable parameter weighting we use L1-regularization.The absolute value of all learned weights is subtractedfrom the objective function.

~w∗ = argmax~w∈Rn

−‖~w‖1 +∑t∈T

log(Φt(~w))

To learn the parameters for the inference modelswe apply a “black-box” optimization method: greedy-stochastic local search. This is a method of direct search(Kolda et al., 2003) that considers a current point inp ∈ Rn and a neighborhood function mapping points tosubsets of Rn, N : Rn →P(Rn). Additionally the op-timization procedure maintains p∗, the best known point.From the current point a new point p′ is randomly se-lected from N (p). If the change improves the objectivefunction, then it is kept; if the change worsens the objec-tive function, then it is accepted with some probabilityε. In this way, the learning explores the parameter space,tending to search in regions of high value while neverbecoming stuck in a local maximum.

We use a neighborhood functionN related to compasssearch. A single parameter or a pair of parameters is se-lected to change by some δ. Additionally, due to the L1regularization, the neighborhood permits setting any sin-gle parameter to zero, encouraging sparse solutions.

There is no straightforward stopping criterion for thissearch, so we limit by time. Empirically we found thatoptimization longer than two hours rarely improved theobjective function substantially.

Not every Φt depends on every element of ~w. Evenin cases where a Φt depends on ~wi, many pieces of thefunction may not. To enable efficient recomputation, apreprocessor constructs for each weight ~wi a DAG indi-cating which parts of functions will need to be recom-puted, and in what order, if that weight is changed. Un-changed function parts return their cached value if usedin the computation of a part that does change.

We also experimented with the Nelder-Mead simplexmethod (Nelder and Mead, 1965) and the multidirec-tional search method of Torczon (1989) but found weakerperformance from these methods.

8.2 Ensemble LearningWe have multiple inference methods, each approachingthe problem of combining the subquestion confidences

from a different intuition and formalizing it in a differ-ent way. To combine all these different approaches wetrain an ensemble. This is a final, convex, confidence es-timation over the multiple choice answers using the pre-dictions of the inference models as features. The ensem-ble learning uses the same training set that the individ-ual closed-form inference models use. To avoid givingexcess weight to inference models that have overfit thetraining set, we use a common technique from stackingensembles (Breiman, 1996). The training set is split intofive folds, each leaving out 20% of the training data, asthough for cross validation. Each inference model fromSection 7.2 is trained on each fold. When the ensemblegathers an inference model’s confidence as a feature foran instance, the inference model uses the learned parame-ters from the fold that excludes that instance. In this way,each inference model’s performance is test-like, and theensemble model does not overly trust overfit models.

The ensemble is a binary logistic regression per an-swer hypothesis using three features from each inferencemodel. The features used are: the probability of the hy-pothesis, the logit of the probability, and the rank of theanswer among the multiple choice answers. Using thelogit of the probability ensures that selecting a singleinference model is in the ensemble’s hypothesis space,achieved by simply setting the weight for that model’slogit feature to one and all other weights to zero.

Each closed-form inference model is also trained onthe full training set. These versions are applied at testtime to generate the features for the ensemble.

9 Collaborative Learning ApplicationApplying a question-answering tool like Watson to com-plex, scenario-driven problems was a challenge that wedidn’t solve until we observed how humans do this.As explained in the Introduction, an inspiration for ourapproach came from seeing medical students explaintheir reasoning about medical test preparation questions.Their process was one of identifying significant details,drawing inferences, and evaluating hypotheses. This wayof attacking a problem is recognizable in the Watson-Paths execution flow. So when it came time to create aninteractive application, it was the obvious conclusion thatit should facilitate a certain way of reasoning about com-plex scenarios. We call this the Collaborative LearningApplication.

9.1 A Learning ApplicationThe Collaborative Learning Application creates a work-flow in which the user and Watson work together to ana-lyze a problem. We believe that this approach will resultin better solutions than if the user or Watson were work-ing alone. It is also thought this will create opportunitiesfor the user and Watson to learn.

Page 13: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

For the user, “to learn” is meant in the traditionalsense. First, we propose that the application aids inteaching critical thinking through its advocacy of a cer-tain reasoning process. Second, we propose that explor-ing Watson’s analyses provides educational value by giv-ing a unique and relevant index into the huge body ofunstructured knowledge that was examined to producethese results.

For Watson, learning is primarily in the sense of ma-chine learning. That is, learning involves deriving la-beled data from usage data and using that data to im-prove our statistical models. What we find interestingabout the WatsonPaths application are the many oppor-tunities to gather such data through both implicit and ex-plicit means. We call these “opportunistic annotations,”because they are gathered in the course of using the sys-tem. There is a synergy between mimicking the way ahuman thinks about a problem and the way the machineanalyzes a problem in developing ways to gather usefuldata.

If the Collaborative Learning Application can success-fully produce educational value for the user or Watson,then we submit that the system (inclusive of the applica-tion and the user) learns and improves. That is, we canexpect the system to provide better results over time. Ofcourse, we know that humans are capable of this propertyand so along that dimension we wish to show that the ap-plication is associated with a faster rate of learning than,for instance, using a search engine over similar corpora.And along the dimension of the machine’s improvement,we wish to show statistically significant results. As theability to achieve this is a function of the amount andquality of usage data gathered, it will be interesting toexplore what requirements this implies.

9.2 A Collaborative Application

The interactions that the application supports are best ex-plained at the logical level. At this level we find theassertion graph and the WatsonPaths processes used topopulate it.

Our initial state is an empty assertion graph, andour first operation is to describe the problem scenariothrough statements of fact and assertions of truth aboutthose statements. WatsonPaths does this through the sce-nario analysis process in which the input is natural lan-guage text and the output is a number of asserted state-ments. Through the Collaborative Learning Application,the user may choose to accept the result of this process,alter it through judgment, or bypass it altogether and cre-ate their own asserted statements. Each of these interac-tions produces data that we hope is valuable for improv-ing the system.

By accepting the result of the process, there is an im-plicit annotation that the result has positive value in the

user’s judgment.Altering the process through judgment produces ex-

plicit annotations by the user. Note that it is a more gen-eral interaction than might be inferred from this contextsince it is done at the assertion graph level and so is ap-plicable to all of the WatsonPaths processes that operateon this data structure. A judgment is in fact the user ex-pressing an opinion about a statement for which the sys-tem has also expressed an opinion. Expressing a similaropinion is an example of positive feedback. Expressinga dissimilar opinion is an example of negative feedback.If we can associate the system’s opinion with a particularWatsonPaths process, then we can use these judgmentsas feedback regarding that process. (Understanding thebias of this feedback mechanism is a concern for us.)

By creating their own asserted statements, the user isgenerating their opinion of the ground truth for the Wat-sonPaths process. Knowing the input the user was oper-ating on to produce this ground truth allows us to derivelabeled data for the process.

In this step one might infer that there is a gated pro-cess in which the user vets and augments the machine’sresults prior to moving on with the analysis. Our currentfocus on medical test preparation questions is amenableto this approach but for very complex inputs (such ashundreds of pages of a patient’s medical record) prac-ticalities may necessitate a different, automatic mode ofoperation. Such a mode might be to allow WatsonPathsto work through the entire scenario alone and then invitethe human to judge or augment the results as they see fit.In fact we are exploring both approaches as each has de-sirable characteristics (primarily understandability in thecase of the gated process and decision facilitation in theautomatic case).

The next operation is to prioritize statements in the as-sertion graph for further inquiry. That is, which state-ments about the scenario have the most promise of pro-ducing relevant inferences? Again, there is a Watson-Paths process to do this, and the user may choose to ac-cept the result of this process, alter it, or bypass it alto-gether. Similarly to the previous case, annotations andlabeled data may be derived from these interactions.

Next in the WatsonPaths process is to apply a semantictemplate to the prioritized statements to generate infer-ences. This semantic template may include informationlike “diseases cause findings” from which we may de-rive a query to infer a disease from a finding or a findingfrom a disease. When using Watson’s question answer-ing function to do this, the form of this query is a nat-ural language question. For example, imagine that thestatement resting tremor (read as “the patient has restingtremor”) has been prioritized. Applying the described se-mantic template then would produce the question “Whatdisease causes resting tremor?” The answers to this ques-

Page 14: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

tion are new inferred factors.There are many opportunities for user feedback in this

process, but perhaps the most interesting one is to inducea semantic template from user interactions. We envisiondoing this by allowing users to ask questions of state-ments themselves and then extracting information fromthem. For example, if the user asks of “resting tremor,”“What causes this?” then we might extract a semantictemplate instance of “things cause resting tremor.” Thevalue of this template might be improved by knowing thetype of instances, and so we may ask the user, “Whattype of things cause resting tremor?” to which we re-ceive an answer of “disease” or “neurological disorder,”among other things. And through typing (a functionalityof WatsonPaths), we may suppose that “resting tremor”is a finding. This may lead us to ask of the user, “Dodiseases cause findings?” or “Do neurological disor-ders cause findings?” The user’s response may furtherrefine the semantic template. How far to refine the se-mantic template and whether it should be done with re-spect to a prior ontology are questions open to experi-mentation. An added personalization aspect is that thesystem could learn user-specific semantic templates, al-lowing each user to employ their own methodology forproblem solving.

Executing the queries produced in the previous opera-tion results in new asserted statements. As mentioned be-fore, Watson’s question answering function can be usedto answer these intermediate queries. Here too, user in-put can help improve Watson’s results. The user can ac-cept Watson’s results, alter them through judgment, orproduce their own. For example, the user can suggest al-ternative names and phrasings for entities and relationsin the question for use in query expansion. And as be-fore, these interactions produce data that can be usedto improve the system. In this case, these results fitnicely into AdaptWatson (Ferrucci and Brown, 2012), amethodology for improving question answering perfor-mance based on example question-answer pairs, data sci-ence, and machine learning.

An advantage of using Watson’s question answeringability for inference is that results are supported by evi-dential passages. These passages are typically a few sen-tences extracted from a document that Watson used dur-ing feature generation for a candidate answer to a ques-tion and so often entail the high confidence answers. Assuch, exposing this evidence to the user can be very ben-eficial as an explanatory tool and so is something we em-phasize in the Collaborative Learning Application. Al-lowing the user to judge this evidence can provide datathat can be used to improve the search components of thequestion answering process as well as to train a justify-ing passage model—a model that classifies passages asjustifying an answer to a question or not.

Evaluating the expanded assertion graph with respectto determining belief in the statements is the role of thebelief engine. Here, judgments by the user about the sig-nificance or irrelevance of statements to the overall casecan aid in how that evaluation is done.

A final step in the overall WatsonPaths process ishypothesis identification. For medical test preparationquestions, hypotheses are provided, but for the Collab-orative Learning Application they are not. For a partic-ular scenario, the list of hypotheses will change as theanalysis progresses. What might have begun with gen-eral disorders may end with specific diseases. And whatmight have begun with diagnosis might transition intotreatment. By observing the patterns of usage, we hopeto automate this higher level of reasoning imposed on thesystem.

The final assertion graph as confirmed by the usercan be stored in Watson’s internal knowledge base. Theknowledge base can store assertions along with con-fidence and provenance information (with appropriatemerging and conflict resolution strategies), and growsas more users interact with the system. This growingpool of background knowledge further improves Wat-son’s question answering capability.

The annotations we expect to derive from the appli-cation are quite specific to the WatsonPaths processes.This is a benefit of choosing processes that mimic hu-man reasoning and so have some degree of familiarityand intuitive performance for the user. This also leads toopportunistic annotations or specific questions posed tothe user that might generate useful data. For example, ifthe user were to dismiss a evidential passage that Wat-son had scored highly, an opportunity arises to ask why.Perhaps a particular question answering component hadgenerated a high score for the passage. Determining ifit should not have can be helpful to improving that com-ponent. Other questions may provide axiomatic infor-mation (such as existence of a paraphrase) that may beuseful in the specific context.

9.3 A Cognitive Computing Application

The values we follow in the development of the Collab-orative Learning Application are described in shorthandas cognitive computing. That notion encompasses threeprimary characteristics: facilitate human reasoning, com-municate in a natural way, and learn and improve (Kellyand Hamm, 2013). How we are addressing the first andlast of these characteristics should be clear at this point,but the second deserves further mention.

Being able to express the assertion graph to the userin an intuitive way is a challenge we are working on,but which has generated positive feedback. Drawing onthe value of concept maps (Daley and Torre, 2010), ourgraph-based visualization provides an approachable pre-

Page 15: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

sentation that users understand quickly.For example, Figure 6 shows the application during

analysis of a scenario in which a patient’s symptoms leadto a diagnosis of Parkinson’s disease which in turn leadsto the answer Substantia nigra. (The ultimate questionbeing answered is, “What part of the patient’s nervoussystem is most likely affected?”) The graph representa-tion allows the user to visually navigate the result fromasserted true statements (nodes arranged at the top of thescreen), through inferences (white nodes in the middleof the screen), and to hypotheses (nodes arranged at thebottom of the screen). Decorations on the graph like linewidth and opacity give the user a sense of how beliefis flowing while significance indicators (dashed lines be-low factors) show which factors the belief engine favoredin its choice of an answer. Something to note is that thecomplete result has 333 nodes and 444 edges and so edit-ing of the graph is needed.

9.4 Current Status

The Collaborative Learning Application is a work inprogress and we are refining and exploring in the con-text of our collaboration with the Cleveland Clinic LernerCollege of Medicine. At that medical school, criticalthinking is taught through a problem-based learning cur-riculum in which students work through medical scenar-ios as a group. The way in which the students do this hassimilarities to the WatsonPaths process, and so we hopethat the application we are building on that functionalitywill be able to facilitate their thinking while providingeducational value—something we hope to measure in anupcoming pilot.

10 Evaluation

As illustrated in the previous section, an interactive, col-laborative clinical decision support tool can benefit fromthe same components and technologies needed for an au-tomatic scenario-based medical question answering sys-tem. Thus developing and testing the automatic systemin the standard way on sets of medical questions has thebenefits of (1) driving the development of the core tech-nology, (2) providing an evaluation of the automatic sys-tem, and (3) improving the components of the interactivesystem; such an evaluation is the subject of this section.Note that an evaluation of the interactive system itself isa separate exercise and will be reported in a future paper.

10.1 Data Sets

For the automatic evaluation of WatsonPaths, we useda set of medical test preparation questions from ExamMaster and McGraw-Hill, which are analogous to theexamples we have used throughout this paper. Thesequestions consist of a paragraph-sized natural language

scenario description of a patient case, optionally accom-panied by a semi-structured tabular structure. The para-graph description typically ends with a punchline ques-tion and a set of multiple choice answers (average 5.2answer choices per question). We excluded from consid-eration questions that require image analysis or whoseanswers are not text segments.

The punchline questions may simply be seeking themost likely disease that caused the patient’s symptoms(e.g., “What is the most likely diagnosis in this pa-tient?”), in which case the question is classified as a di-agnosis question. The diagnosis question set reported inthis evaluation was identified by independent annotators.Non-diagnosis punchline questions may include appro-priate treatments, the organism causing the disease, andso on (e.g., “What is the most appropriate treatment?”and “Which organism is the most likely cause of hismeningitis?” respectively).

We split our data set of 2190 questions into a trainingset of 1000 questions, a development set of 690 ques-tions, and a blind test set of 500 questions. The develop-ment set was used to iteratively drive the development ofthe scenario analysis, relation generation, and belief en-gine components, and for parameter tuning. The trainingset was used to build models by the learning componentdescribed in Section 8.

As noted earlier, our learning process requires sub-question training data to consolidate groups of questionanswering features into smaller, more manageable setsof features. We do not have robust and comprehensiveground truth for a sufficiently large set of our automat-ically generated subquestions. Instead, we use a pre-existing set of simple factoid medical questions as sub-question training data: the Doctor’s Dilemma (DD) ques-tion set (American College of Physicians, 2014). DD isan established benchmark used to assess performance infactoid medical question answering. We use 1039 DDquestions (with a known answer key) as our subquestiontraining data. Although the Doctor’s Dilemma questionsdo have some basic similarity to the subquestions we askin assertion graphs, there are some important differences:

• In an assertion graph subquestion, there is usuallyone known entity and one relation that is beingasked about. For DD, the question may constrainthe answer by multiple entities and relations.

• An assertion graph subquestion like “What causeshypertension?” has many correct answers, whereasDD questions have a single best answer.

• There may be a mismatch between how confidencefor DD is trained and how subquestion confidenceis used in an inference method. The DD confidencemodel is trained to maximize log-likelihood on a

Page 16: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Figure 6: WatsonPaths User Interface

correct/incorrect binary classification task. In con-trast, many probabilistic inference methods use con-fidence as something like strength of indication orrelevance.

For all these reasons, DD data is poorly suited to train-ing a complete model for judging edge-strength for sub-question edges in WatsonPaths. But we have found thatDD data is useful as subquestion training data1 in the hy-brid learning approach described in Section 8; we use1039 DD questions for consolidating question answeringfeatures and then use the smaller, consolidated set of fea-tures as inputs to the inference models that are trained onthe 1000 medical test preparation questions.

10.2 Experimental SetupFor comparison purposes, we used our Watson questionanswering system adapted for the medical domain (Fer-rucci et al., 2013) as a baseline system. This system takesthe entire scenario as input and evaluates each multiplechoice answer based on its likelihood of being the cor-rect answer to the punchline question. This one-shot ap-

1We are also investigating the use of actual subquestionsgenerated by WatsonPaths as training data. Building a compre-hensive answer key for such questions is very time consuming,and an incomplete answer key can be less effective. Althoughthis approach has not yet succeeded, it may still succeed if weinvest much more in building a bigger, better answer key foractual WatsonPaths subquestions.

proach to answering medical scenario questions contrastswith the WatsonPaths approach of decomposing the sce-nario, asking questions of atomic factors, and perform-ing probabilistic inference over the resulting graphicalmodel.

We tuned various parameters in the WatsonPaths sys-tem on the development set to balance speed and perfor-mance. The system performs one iteration each of for-ward and backward relation generation. The minimumconfidence threshold for expanding a node is 0.25, andthe maximum number of nodes expanded per iterationis 40. In the relation generation component, the Watsonmedical question answering system returns all answerswith a confidence of above 0.01.

We evaluate system performance both on the full testset as well as on the diagnosis subset only. The reason forevaluating the diagnosis subset separately is because, inthe vast majority of these questions, either the punchlinequestion seeks the diagnosis or depends on a correct di-agnosis along the way. We use the full 1000 questions inthe training set to learn the models for both the baselinesystem and the WatsonPaths system. As noted earlier,Doctor’s Dilemma training data is used to consolidatequestion answering features in the WatsonPaths system.We did not use Doctor’s Dilemma training data for anypurpose in the baseline system.

Page 17: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Full DiagnosisAccuracy Baseline 42.0% 53.8%

WatsonPaths 48.0% 64.1%Confidence Baseline 59.8% 75.3%Weighted Score WatsonPaths 67.5% 81.8%

Table 1: WatsonPaths Performance Results

10.3 Results and DiscussionTable 1 shows the results of our evaluation on a set of 500blind questions of which a subset of 156 questions wereidentified as diagnosis questions by annotators.

We report results using two metrics. Accuracy sim-ply measures the percentage of questions for which asystem ranks the correct answer in top position. Con-fidence Weighted Score is a metric that takes into ac-count both the accuracy of the system and its confidencein producing the top answer (Voorhees, 2003). We sortall <question, top answer> pairs in an evaluation set indecreasing order of the system’s confidence in the topanswer and compute the confidence weighted score as

CWS =1n

n∑i=1

number correct in first i ranksi

where n is the number of questions in the evaluationset. This metric rewards systems for more accurately as-signing high confidences to correct answers, an impor-tant consideration for real-world question answering andmedical diagnosis systems.

Our results show statistically significant improvementsat p<0.05 (results in bold in Table 1) on the full blindset of 500 questions for both metrics. For the diagnosissubset, the accuracy improvement is statistically signif-icant but the confidence weighted score improvement isnot, even with a 6+% score increase. This is likely dueto the small diagnosis subset, which contains only 156questions.

11 Related WorkClinical decision support systems (CDSS) have had along history of development starting from the early daysof artificial intelligence. These systems use a variety ofknowledge representations, reasoning processes, systemarchitectures, scope of medical domain, and types of de-cision (Musen et al., 2014). Although several studieshave reported on the success of CDSS implementationsin improving clinical outcomes (Kawamoto et al., 2005;Roshanov et al., 2013), widespread adoption and routineuse is still lacking (Osheroff et al., 2007).

The pioneering Leeds abdominal pain system (DeDombal et al., 1972) used structured knowledge in theform of conditional probabilities for diseases and theirsymptoms. Its success at using Bayesian reasoning was

comparable to experienced clinicians at the Leeds hos-pital where it was developed. But it did not adapt suc-cessfully to other hospitals or regions, indicating the brit-tleness of some systems when they are separated fromtheir original developers. A recent systemic review of162 CDSS implementations shows that success at clin-ical trials is significantly associated with systems thatwere evaluated by their own developers (Roshanov et al.,2013). MYCIN (Shortliffe, 1976) was another early sys-tem which used structured representation in the form ofproduction rules. Its scope was limited to the treatmentof infectious diseases and, as with other systems withstructured knowledge bases, required expert humans todevelop and maintain these production rules. This man-ual process can prove to be infeasible in many medi-cal specialties where active research produces new di-agnosis and treatment guidelines and phases out olderones. Many CDSS implementations mitigate this limi-tation by focusing their manual decision logic develop-ment effort on clinical guidelines for specific diseasesor treatments, e.g., hypertension management (Goldsteinet al., 2001). But such systems lack the ability to han-dle patient comorbidities and concurrent treatment plans(Sittig et al., 2008). Another notable system that usedstructured knowledge was Internist-1. The knowledgebase contained disease-to-finding mappings representedas conditional probabilities (of disease given finding andof finding given disease) mapped to a 1–5 scale. Despiteinitial success as a diagnostic tool, its design as an ex-pert consultant was not considered to meet the informa-tion needs of most physicians. Eventually, its underlyingknowledge base helped its evolution into an electronicreference that can provide physicians with customizedinformation (Miller et al., 1986). A similar system, DX-plain (Barnett et al., 1987) continues to be commerciallysuccessful and extensively used. Rather than focus on adefinitive diagnosis, it provides the physician with a listof differential diagnoses along with descriptive informa-tion and bibliographic references.

Other systems in commercial use have adopted the un-structured medical text reference approach directly, us-ing search technology to provide decision support. Isabelprovides diagnostic support using natural language pro-cessing of medical textbooks and journals. Other com-mercial systems like UpToDate and ClinicalKey forgothe diagnostic support and provide a search capabilityto their medical textbooks and other unstructured refer-ences. Although search over unstructured content makesit easier to incorporate new knowledge, it shifts the rea-soning load from the system back to the physician.

In comparison to the above systems, WatsonPaths usesa hybrid approach. It uses question-answering technol-ogy over unstructured medical content to obtain answersto specific subquestions generated by WatsonPaths. For

Page 18: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

this task, it builds on the search functionality by extract-ing answer entities from the search results and seekingsupporting evidence for them in order to estimate answerconfidences. These answers are then treated as inferencesby WatsonPaths over which it can perform probabilis-tic reasoning without requiring a probabilistic knowledgebase.

Another major area of difference between CDSS im-plementations is the extent of their integration to thehealth information system and workflow used by thephysicians. Studies have shown that CDSS are mosteffective when they are integrated within the workflow(Kawamoto et al., 2005; Roshanov et al., 2013). Manyof the guideline-based CDSS implementations are inte-grated with the health information system and workflow,having access to the data being entered and providingtimely decision support in the form of alerts. But thisintegration is limited to the structured data contained ina patient’s electronic medical record. When a CDSS re-quires information like findings, assessments, or plans inclinical notes written by a healthcare provider, existingsystems are unable to extract them. As a result, search-based CDSS remain a separate consultative tool. Thescenario analysis capability of WatsonPaths provides themeans to analyze these unstructured clinical notes andserves as a means for integration into the health informa-tion system.

A major point of differentiation between the CDSSimplementations described above and the design of Wat-sonPaths is its ability to serve as a collaborative problemsolving tool as described in Section 9. When teamed witha student, the role of WatsonPaths approaches that of in-telligent tutoring systems (Woolf, 2009). Key differencesexist, however, in the representation of domain knowl-edge and student knowledge. Most tutoring systems havea structured representation of the domain knowledge,carrying with it the same knowledge update and mainte-nance issues faced by CDSS implementations. Watson-Paths lacks a student model (or in general a model of thecollaborator), which is a key capability of intelligent tu-toring systems. As a result, it cannot guide or customizethe tutoring according to student needs, relying insteadon an instructor’s choice of the problem scenario to beused.

12 Conclusions and Further WorkWatsonPaths is a system for scenario-based question an-swering that has a graphical model at its core. It includesa collaborative decision support tool that allows users tounderstand and contribute to the reasoning process. Wehave developed WatsonPaths on a set of multiple choicequestions from the medical domain. On this test set,WatsonPaths shows a significant improvement over Wat-son. Although the test preparation question set has been

important for the early development of the system, wehave designed WatsonPaths to function well beyond it.In future work, we plan to extend WatsonPaths in severalways.

The present set of questions are all multiple choicequestions. This means that hypotheses have already beenidentified, and it is also known that exactly one of thehypotheses is the correct answer. Although they havemade the early development of scenario-based ques-tion answering more straightforward, the overall Wat-sonPaths architecture does not rely on these constraints.For instance, we can easily remove the confidence re-estimation phase for the closed-form inference systemsand the “exactly one” constraint from the belief engine.Also, it will be straightforward to add a simple hypoth-esis identification step to the main loop. One way todo this is to find nodes whose type corresponds to thetype being asked about in the punchline question. We al-ready find such correspondances in the base Watson sys-tem (Chu-Carroll et al., 2012). In the collaborative ap-plication, we are exploring ways of having the user helpidentify hypotheses.

We also plan to extend WatsonPaths beyond the med-ical domain. For medical applications, it might havebeen easier to design Watson with certain medical as-pects hardcoded into the flow of execution. Instead wedesigned the overall flow as well as each component tobe general across domains. Note that the Emerald couldbe replaced by a structure from a different domain, andthe basic semantics we have explored: matching, indica-tive and causal, have no requirement that the graph struc-ture come from medicine. Even the causal aspect of thebelief engine could apply to any domain that involves di-agnostic inference (e.g., automotive repair). Most impor-tantly, the way that subquestions are answered is com-pletely general. By asking the right subquestions and us-ing the right corpus, we can apply WatsonPaths to anyscenario-based question answering problem. We hope todevelop a toolbox of expansion strategies, relation gen-erators, and inference mechanisms that can be reused aswe apply WatsonPaths to new domains.

The most important area for further work is on the col-laborative user application. In the early development ofthe system, it was necessary to focus on automatic per-formance (as presented in Section 10) to create a viablescenario-based question answering system. As this per-formance improves, we are focusing more on how Wat-sonPaths can interact better with users. We plan to de-velop and more rigorously evaluate how WatsonPathslearns from users and how users learn from WatsonPaths.

In a fully automatic system, the user receives an an-swer using little or no time or cognitive effort. In a col-laborative system, the user spends some time and effort,and potentially gets a better answer. We suspect that, in

Page 19: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

many applications of scenario-based question answering,this will be an attractive tradeoff for the user, because ofthe complexity of the scenario and the importance of theanswer. Our objective is to minimize the time and effortrequired of users and maximize the benefit they receive.The combination of the user and WatsonPaths should beable to handle more difficult problems more quickly thaneither could alone.

ReferencesAmerican College of Physicians. 2014. Doc-

tor’s Dilemma competition. http://www.acponline.org/residents_fellows/competitions/doctors_dilemma/.

G. Octo Barnett, James J. Cimino, Jon A. Hupp, and Ed-ward P. Hoffer. 1987. DXplain: An evolving diagnos-tic decision-support system. JAMA, 258(1):67–74.

Judith L. Bowen. 2006. Educational strategies to pro-mote clinical diagnostic reasoning. New EnglandJournal of Medicine, 355(21):2217–2225.

E. Boyd, Kenneth W. Kennedy, Richard A. Tapia, andVirginia Joanne Torczon. 1989. Multi-directionalsearch: A direct search algorithm for parallel ma-chines. Technical report, Rice University.

Leo Breiman. 1996. Stacked regressions. MachineLearning, 24(1):49–64.

Rowland W. Chang, Georges Bordage, and Karen J.Connell. 1998. Cognition, confidence, and clinicalskills: The importance of early problem representa-tion during case presentations. Academic Medicine,73(10):S109–111.

Patricia W. Cheng. 1997. From covariation to causa-tion: A causal power theory. Psychological Review,104(2):367.

Jennifer Chu-Carroll, James Fan, Branimir K. Boguraev,David Carmel, Dafna Sheinwald, and Chris Welty.2012. Finding needles in the haystack: Search andcandidate generation. IBM Journal of Research andDevelopment, 56(3/4):6:1–6:12.

Barbara J. Daley and Dario M. Torre. 2010. Conceptmaps in medical education: An analytical literature re-view. Medical Education, 44(5):440–448.

F. T. De Dombal, D. J. Leaper, John R. Staniland, A. P.McCann, and Jane C. Horrocks. 1972. Computer-aided diagnosis of acute abdominal pain. British Med-ical Journal, 2(5804):9.

David Ferrucci and Eric Brown. 2012. AdaptWatson:A methodology for developing and adapting Watsontechnology. Technical Report RC25244, IBM Re-search Division.

David Ferrucci, Eric Brown, Jennifer Chu-Carroll, JamesFan, David Gondek, Aditya A. Kalyanpur, AdamLally, J. William Murdock, Eric Nyberg, John Prager,Nico Schlaefer, and Chris Welty. 2010. Building Wat-son: An overview of the DeepQA project. AI Maga-zine, 31:59–79.

David Ferrucci, Anthony Levas, Sugato Bagchi, DavidGondek, and Erik T. Mueller. 2013. Watson: BeyondJeopardy! Artificial Intelligence, 199–200:93–105.

David Ferrucci. 2012. Introduction to “This is Wat-son”. IBM Journal of Research and Development,56(3/4):1:1–1:15.

M. K. Goldstein, B. B. Hoffman, R. W. Coleman, S. W.Tu, R. D. Shankar, M. O’Connor, S. Martins, A. Ad-vani, and M. A. Musen. 2001. Patient safetyin guideline-based decision support for hypertensionmanagement: ATHENA DSS. In Proceedings of theAMIA Symposium, page 214. American Medical In-formatics Association.

David C. Gondek, Adam Lally, Aditya Kalyanpur,J. William Murdock, Pablo A. Duboue, Lei Zhang,Yue Pan, Zhao Ming Qiu, and Chris Welty. 2012.A framework for merging and ranking of answers inDeepQA. IBM Journal of Research and Development,56(3/4):14:1–14:12.

Kensaku Kawamoto, Caitlin A. Houlihan, E. AndrewBalas, and David F. Lobach. 2005. Improving clin-ical practice using clinical decision support systems:A systematic review of trials to identify features criti-cal to success. British Medical Journal, 330:765–72.

John E. Kelly and Steve Hamm. 2013. Smart ma-chines: IBM’s Watson and the era of cognitive com-puting. Columbia University Press, New York.

Tamara G. Kolda, Robert Michael Lewis, and VirginiaTorczon. 2003. Optimization by direct search: Newperspectives on some classical and modern methods.SIAM Review, 45:385–482.

John Lafferty, Andrew McCallum, and Fernando C. N.Pereira. 2001. Conditional random fields: Probabilis-tic models for segmenting and labeling sequence data.In Proceedings of the 18th International Conferenceon Machine Learning 2001 (ICML 2001), pages 282–289.

Adam Lally, John M. Prager, Michael C. McCord, Bra-nimir K. Boguraev, Siddharth Patwardhan, James Fan,Paul Fodor, and Jennifer Chu-Carroll. 2012. Questionanalysis: How Watson reads a clue. IBM Journal ofResearch and Development, 56(3/4):2:1–2:14.

Michael C. McCord, J. William Murdock, and Bran-imir K. Boguraev. 2012. Deep parsing in Wat-son. IBM Journal of Research and Development,56(3/4):3:1–3:15.

Page 20: IBM Research Reportdomino.research.ibm.com/library/cyberdig.nsf/papers/088F...RC25489 (WAT1409-048) September 17, 2014 Computer Science IBM Research Report WatsonPaths: Scenario-based

Randolph A. Miller, Melissa A. McNeil, Sue M. Challi-nor, Fred E. Masarie Jr, and Jack D. Myers. 1986. TheInternist-1/Quick Medical Reference project – statusreport. Western Journal of Medicine, 145(6):816.

J. William Murdock, James Fan, Adam Lally, HidekiShima, and Branimir K. Boguraev. 2012a. Textualevidence gathering and analysis. IBM Journal of Re-search and Development, 56(3/4):8:1–8:14.

J. William Murdock, Aditya Kalyanpur, Chris Welty,James Fan, David Ferrucci, David C. Gondek, LeiZhang, and Hiroshi Kanayama. 2012b. Typing can-didate answers using type coercion. IBM Journal ofResearch and Development, 56(3/4):7:1–7:13.

Mark A. Musen, Blackford Middleton, and Robert A.Greenes. 2014. Clinical decision-support systems. InBiomedical Informatics, pages 643–674. Springer.

National Library of Medicine. 2009. UMLS ref-erence manual. Bethesda, MD: National Library ofMedicine (US). http://www.ncbi.nlm.nih.gov/books/NBK9676/.

J. A. Nelder and R. Mead. 1965. A simplex methodfor function minimization. Computer Journal, 7:308–313.

Jerome A. Osheroff, Jonathan M. Teich, Blackford Mid-dleton, Elaine B. Steen, Adam Wright, and Don E.Detmer. 2007. A roadmap for national action on clini-cal decision support. Journal of the American MedicalInformatics Association, 14(2):141–145.

Judea Pearl. 1988. Probabilistic reasoning in intelli-gent systems: Networks of plausible inference. Mor-gan Kaufmann, San Francisco.

J. M. Prager, J. Chu-Carroll, and K. Czuba. 2004. Ques-tion answering using constraint satisfaction: QA-by-dossier-with-constraints. In Proceedings of the 42ndAssociation for Computational Linguistics, pages575–582, Barcelona.

Pavel S. Roshanov, Natasha Fernandes, Jeff M. Wilczyn-ski, Brian J. Hemens, John J. You, Steven M. Handler,Robby Nieuwlaat, Nathan M. Souza, Joseph Beyene,Harriette G. C. Van Spall, Amit X. Garg, and R. BrianHaynes. 2013. Features of effective computerisedclinical decision support systems: Meta-regressionof 162 randomised trials. British Medical Journal,346(f657).

Edward H. Shortliffe. 1976. MYCIN: Computer-basedmedical consultations. Elsevier, New York.

Dean F. Sittig, Adam Wright, Jerome A. Osheroff,Blackford Middleton, Jonathan M. Teich, Joan S. Ash,Emily Campbell, and David W. Bates. 2008. Grandchallenges in clinical decision support. Journal ofBiomedical Informatics, 41(2):387–392.

Ellen M. Voorhees. 2003. Overview of TREC 2002. InProceedings of the Text REtrieval Conference.

Beverly Park Woolf. 2009. Building intelligent interac-tive tutors. Morgan Kaufmann, Burlington, MA.

Alan L. Yuille and Hongjing Lu. 2007. The noisy-logical distribution and its application to causal infer-ence. In John C. Platt, Daphne Koller, Yoram Singer,and Sam T. Roweis, editors, NIPS. Curran Associates,Inc.