28
SEQUENCE PACKAGE ANALYSIS: A New Natural Language Understanding Method for Performing Data Mining of Help-Line Calls and Doctor-Patient Interviews AMY NEUSTEIN, Ph.D. LINGUISTIC TECNOLOGY SYSTEMS [email protected] PRESENTATION TO NLUCS Workshop at ICEIS University of Portugal April 13, 2004

PRESENTATION TO NLUCS Workshop at ICEIS University of Portugal April 13, 2004

Embed Size (px)

DESCRIPTION

SEQUENCE PACKAGE ANALYSIS : A New Natural Language Understanding Method for Performing Data Mining of Help-Line Calls and Doctor-Patient Interviews AMY NEUSTEIN, Ph.D. LINGUISTIC TECNOLOGY SYSTEMS [email protected]. PRESENTATION TO NLUCS Workshop at ICEIS University of Portugal April 13, 2004. - PowerPoint PPT Presentation

Citation preview

Page 1: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

SEQUENCE PACKAGE ANALYSIS: A New Natural Language Understanding Method for

Performing Data Mining of Help-Line Calls and Doctor-

Patient Interviews

AMY NEUSTEIN, Ph.D.LINGUISTIC TECNOLOGY SYSTEMS

[email protected]

PRESENTATION TO NLUCS Workshop at ICEIS

University of Portugal

April 13, 2004

Page 2: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

WHY DO WE NEED A NEW NATURAL LANGUAGE METHOD?

1) In the real world speakers do not always use “key” words that appear in the application vocabulary, which can lead to a poor word match between the user’s input and the application vocabulary.

2) To build a Statistical Language Model to accommodate to the various ways users speak requires a large data corpus that is costly to assemble, and still there is no guarantee that an accurate word match will be found.

Page 3: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

APPLICATIONS OF SEQUENCE PACKAGE ANALYSIS:

1) An “add on” layer of intelligence to audio data mining programs used for recorded help-line calls to extract business intelligence data and to detect early warning signs of caller frustration.

2) An “add on” layer of intelligence for mining doctor-patient interviews to uncover important medical history data, often buried in the ambiguity of patient dialog.

Page 4: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

How Does Sequence Package Analysis (SPA) Work?

SPA provides a “filter” for the front end of a speech recognizer, using generic templates that can be deployed in many different applications and languages; SPA can be used with vector-based models that hold spaces and determine “global weighting” of lexical items.

SPA parses NL dialog to locate a series of related turns that are

discretely packaged as a sequence of conversational interaction.

SPA locates entire sequence packages rather than isolated key words, operating on the principle that it is easier to find a generic sequence package in a dialog than specific keywords. That is, speakers are more likely to vary in their choice of keywords than in their conversational sequence patterns, making it more difficult for an speech application to represent a speaker’s wide range of word

choices than to represent actual conversational sequence patterns.

Page 5: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

METHODOLOGICAL BASIS OF SPA

SPA draws mainly from the field of conversation analysis: the study of the orderly properties of interactive dialog that revolve around the turn-taking process and other sequentially based features that are part of that process, such as the production of recycled turn beginnings when there is an overlap with a prior turn.

SPA focuses on social action and how human-machine and human-human dialog is accomplished as a situated, interactive event. The discourse structures are therefore analyzed for their social interactive value rather than solely for their grammatical discourse structure.

Page 6: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

ALGORITHMIC DESIGN OF SPA

SPA algorithms, which are currently under development, consist of sequences that are either small segments of dialog or large sequences that can potentially span the

entire dialog.

But regardless of the size of the sequence package, the purpose of SPA is to locate the indigenous patterns in the dialog that evolve as the dialog unfolds.

By using SPA to parse Natural Language dialog, those features which are evolving and dynamic (e.g., early warning signs of caller frustration; or a patient’s concerns about an illness) can be detected by grammars that are flexible enough to recognize dynamic patterns.

Page 7: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

THE HEURISTIC VALUE OF SPA

1. Building Application Vocabularies:

The SPA method of parsing dialog allows the discovery of new words, to be added to the application vocabulary, by locating the generic sequence packages in which such words appear.

2. Gathering Business Intelligence and Medical History Data:

By tracking the nature and frequency of sequence packages, the system can identify important business intelligence data and medical history

data that would have ordinarily eluded the system.

Page 8: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

VALIDATION OF SEQUENCE PACKAGE ANALYSIS

Does the addition of SPA improve speech recognition capabilities?

Hypothesis “A”: By adding an SPA filter to a speech recognizer to improve analysis of speech input, one can significantly streamline the corpus of data required to build a Statistical Language Model.

Hypothesis “B”: By adding an SPA filter to a Statistical Language Model that contains the full spectrum of possible utterances (as opposed to a streamlined corpus of data), the SLM can better differentiate among multiple utterances accepted by the recognizer.

Page 9: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

USING SPA IN THE CALL CENTER: MINING HELP-LINE CALLS FOR BUSINESS INTELLIGENCE DATA

A caller needs a service call but rather than use words in the application vocabulary such as “service call” or “technician” this is what the frustrated caller says either to the IVR-driven auto attendant at the help-line desk or to the human agent at the call center.

Caller: “I really can’t do this myself. I can’t get this to work without someone coming here. I really don’t know what to do with this.”

Page 10: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

Finding the Sequence Package in the Dialog

Example The sequence package consists of a repeated use of pronouns

(and similar unnamed referents), standing in place of nouns, in very close proximity:

• a short, condensed complaint-- referenced by pronouns (“I really can’t do this myself”)

• the amplification of the source of the trouble (and the request for assistance) but with the frequent use of pronouns that have no stated subject/object referents (“I can’t get this to work without someone coming here”)

• a recycling of the first part of the complaint with the same patterned use of pronouns in place of nouns (“I really don’t know what to do with this”)

Page 11: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

FILTERING THE INPUT

First, the SPA “filter” would direct the speech engine to the second part of the complaint utterance-- the amplification of the source of the

trouble (and request for assistance):

“I can’t get this to work without someone coming here”

Second, rather than run the whole utterance through the SLM, only the second part of the complaint would be run through the SLM to find its

closest statistical approximation.

Third, once the closest word match is made to this second part of the complaint, the SLM would then add this “new” phrase to the application

vocabulary.

Page 12: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

MINING HELP-LINE CALLS FOR SIGNS OF CALLER FRUSTRATION

• An SPA-driven mining program would look for conversational sequence patterns [instead of key words or changes in prosody] to detect signs of caller frustration.

• While speakers vary widely in their choice of words and in their stress patterns [some speakers may increase their pitch when upset while others may not], their conversational sequence patterns -- which are derived from the highly systematic properties that guide the production of talk-- nevertheless remain consistent across a wide spectrum of callers.

Page 13: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

Early Signs of Caller

Australian Help-Line Desk

Caller: “I’ve installed Office 97 and…I was a bit stupid. I went into uninstall and um pulled off a whole stack of items off the uninstall and it was a very silly thing to do so now when I start up my computer I get a screen um which say um a black- a black and white screen which says never delete this item. It’s a message screen and every time I start up it comes up……[deleted text]………………………………………

Caller: “I’m wondering if I reinstall will I wipe out [my documents]”

Agent: “Okay, well look I could certainly have a technician look at the problem for you; we do charge for are you aware of that?”

Caller: “I’m just asking a question - I’m just wondering whether or not I should uninstall Microsoft Word?”

Page 14: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

USING SPA TO LOCATE THE RELEVANT CONVERSATIONAL SEQUENCE PATTERNS

Step One: Locate the pre-question phrases to reports of troubles and requests for assistance:

“I’m wondering if”

“I’m just asking a question”

“I’m just wondering whether or not” Step Two: Quantify the number of times and the proximity of such pre-question phrases.

Step Three: Determine if they escalate or, in the alternative, diminish?

Page 15: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

ANALYSISThe caller to the Australian help-line began her report of the trouble as a long winded narrative, but with the noticeable absence of a request for help.

The caller later produced pre-question phrases when she made her request for help; however, these phrases began to escalate (by being combined with one another) just at the point where she began to show signs of frustration: “I’m just asking a question - I’m just wondering whether or not I should uninstall Microsoft Word?”

As one can see, such conversational sequence patterns evolve within the dynamic flow of dialog. By applying an SPA approach one can pinpoint these indigenous features of talk that evade standard speech recognizers.

Page 16: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

MINING MEDICAL INTERVIEWSTHE PROBLEM:

• Patients often give very important medical history data about themselves and other family members at the wrong place in the medical encounter (such as at the very end of the medical interview or during a routine physical exam) when the doctor is less likely to be paying attention in that he has already gone over those areas with the patient.

• When patients give medical information at the wrong place in the

interview, the data can be lost because the doctor’s attention is now focused on other things.

Page 17: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

MEDICAL INTERVIEWS

The Solution:

SPA locates specific conversational sequence patterns in which crucial medical history data is embedded.

By locating those sequence package templates, important medical history data can be extracted--similar to the way business intelligence data can be extracted from help-line calls.

Page 18: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

ILLUSTRATIONPatient withholds vital family history data about osteosarcoma (bone cancer).

Patient discloses this information at the point in the medical encounter (viz., during a brief medical exam) when discussions of family history data were no longer the main topic.

Patient embeds this history data about bone cancer in the form of a narrative -- as if she were casually telling a “story” to a neighbor or friend-- presumably hoping that by downplaying its significance the doctor would give it much less attention than had she come out with it directly

when queried about family illnesses.

Page 19: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

DIALOG SAMPLE

Patient: “I become terribly worried about my pain, which reminds me of the arthritic pain that my sister had, which turned out to be bone cancer, so I worry whenever I have pain because I don’t know if it is what she had.”

Page 20: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

THE SEQUENCE PACKAGE TEMPLATE: A HIGH USAGE OF NARRATIVE PHRASES IN

CLOSE PROXIMITY

SEQUENCE PACKAGE DIVIDED INTO 4 PARTS:• a short condensed and somewhat nonspecific concern preceded by a narrative phrase:

I become terribly worried about I become terribly worried about my painmy pain

•an expansion of the concern, citing the troublesome datum (“bone cancer”), which is embedded with two narrative predicates:

which reminds me which reminds me of the arthritic pain that myof the arthritic pain that my sister hadsister had which turned out to which turned out to bone cancerbone cancer

Page 21: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

SEQUENCE PACKAGE, CONT.

•a recycling of the nonspecific concern preceded by a narrative phrase:

so I worry whenever so I worry whenever I have any painI have any pain

•a reference back to the expanded concern, but only with the use of pronouns that serve as anaphors, referring back to the expanded concern:

because I don’t know if because I don’t know if it is what she hadit is what she had

Page 22: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

EXTRACTING MEDICAL HISTORY DATA BY USING SPA

The SPA “filter” would direct the speech engine to search for specific content material embedded within the two narrative predicates, appearing in the second part of the four-part sequence package (“which reminds me of…which turned out to be...”)

By searching the sequence package templates, the mining program uncovers important family history data (arthritic pain, ultimately diagnosed as bone cancer) that the patient buried in the interview by using an informal narrative style, replete with anaphors and non specific referents, and by offering this family history data AFTER the physician had already completed his review of family medical history.

Page 23: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

Mining Wiretapped Communications

The following example shows how by applying an SPA approach to wiretapped dialog, one can flag important security information that is cleverly disguiseddisguised by the suspects:

Page 24: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

ILLUSTRATIONSpeaker “A” is trying to educate Speaker “B” about a new meeting place whose location is very important. Any confusion or misunderstanding about this meeting place could spoil the plans.

But Speaker “A” is very clever:

First, he stays away from buzz words (such as naming a bridge, a tunnel or a street).

Second, he refrains from making any comments about how vital it is to get these instructions right.

Page 25: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

Dialog Example

Speaker “A”: Come to the intersection near Juniors? Juniors? (the question mark (the question mark shows an upward intonation) shows an upward intonation) 0.2 - 0.5 second pause 0.2 - 0.5 second pause (speaker then (speaker then pauses briefly) pauses briefly)

Speaker “B”: 1.2 second pause

Speaker “A”: You know the thoroughfare with the big traffic light?

Speaker “B”: Juniors, yeah.

Page 26: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

THE SEQUENCE PACKAGE

Speaker “A”: Come to the intersection near Juniors? 0.2-0.5

Speaker “B”: 1.2 seconds of silence

• A noun referent (“Juniors”) with an upward intonation

• A brief pause, giving the listener the chance to show recognition or ask for clarification.

• Silence by the listener which indicates lack of understanding or confusion.

Page 27: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

SEQUENCE PACKAGE CONT.

Speaker “A”: You know the thoroughfare with the big traffic light?

Speaker “B”: Juniors, yeah.

• Speaker “A” produces a clarification of the noun referent (“Juniors”)

(“You know the thoroughfare with...”)

• Speaker “B” produces a repeat of noun referent (“Juniors”) - the source of the recognition trouble - followed by a recognitional marker (“Yeah”)--which demonstrates to Speaker “A” that he has “corrected” the misunderstanding. But had he simply produced a recognitional marker (‘yeah’) without mentioning the source of the trouble (“Juniors”), there would be no indication to the other speaker that he now recognizes the importance of this meeting place.

Page 28: PRESENTATION TO NLUCS Workshop at ICEIS  University of Portugal April 13, 2004

CODASPA provides a new NLU method for designing intelligent software packages that can serve as “filter” for the front end of a speech recognizer.

Since the SPA templates are generic, they can be deployed in many different applications and across many languages to do the following:

1) extract business intelligence data from call center recordings;

2) detect early warning signs of caller frustration in a help-line call;

3)uncover important medical history data buried in the medical interview; and

4)learn the plans and operations of suspected terrorists.