Dynamic Search Using Semantics & Statistics

Text Mining - Bayesian Topic Modeling for Interactive Retrieval

at SAP and Cisco

Ram AkellaUniversity of California and Stanford

With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC andPaul Hofmann SAP Labs

October 6, 2011 SAP

Outline

• Motivation• Statistical Topic Modeling - SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo

Outline

• Motivation• Statistical Topic Modeling - SAP & Saffron • Knowledge Extraction and Reuse in Cisco• Interactive Retrieval• Interactive Retrieval Demo

Motivation

10/6/2011

SEARCH

Depression treatment of patients…

Depression influence on

family relationships…

DOCTOR

SOCIAL SCIENTIST

q1: elderly depression

q2: depression symptoms

q3: symptoms and treatment

User expects to find more relevant results each time she interacts with the system

Relevance of the presented documents depends on user context

Interactive Retrieval Model Query

User Feeback

Feedback and propagation to

similar documents

Information needUpdate

DocumentCollection

Metadata Generation System

Interactive Retrieval System

Interactive Retrieval Model Query

User Feeback

Feedback and propagation to

similar documents

Information needUpdate

DocumentCollection

Interactive Retrieval System

Metadata Generation SystemAdd to the document metadata that facilitates the retrieval processThis metadata consist of:

1. Statistical Topic Mixture2. Knowledge Extraction basedon Business process (problem, cause, solution)

Outline• Motivation

• Statistical Topic Modeling - SAP & Saffron– Motivation– Related Work– Proposed Approach– Topic Modeling and Entity Association

• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo

Topic Modeling: Motivation• Given a set of documents, we want to identify the main areas or topics

discussed in a unsupervised manner. We take advantage of the semantic associations between words across the documents.

If two words appear in the same document, they should be related.

• For each topic we have different distributions of words and each document might contain material about a variety of topics.

Play

Music

Sports

10/6/2011

Topic 1 (80%)Sports

Topic 2 (5%)

Topic 3 (20%)Common Words

Topic 1Sports

net

game

ball

ball net racquet

notes

instrument

Related WorkLDA[2003] Correlated

Topics [2005]Pachinko Allocation Model [2006]

Our Model GD-LDA

Complexity based on # oftopics K

K 2K

Speed

Scalable

Handles Topic Correlations

Effective topic selection and truncation

Our Approach – The higher probability mass is accommodated in the upper part of the

tree (this facilitates the truncation and reduction in the number of topics)

– We can define a method to determine the number of topics suitable for a particular dataset without training the model several times (each time for a given number of specified topics)

10/6/2011

…

bushcampaign

mccainbradley

republicancandidate

filmshowmusicmoviestoryplay

companypercentstockmarketpricerate

patientDiseasePeopleStudyMedicHealth

peacetalksyrianclintonsyriagolan

…

…

0.00960.0146

0.0310 0.0660

0.0851

Experimental SetupThe datasets are from two types:• Scientific Articles (NIPS)

– Longer documents

• News Data (NYT, APW, XIE)– Shorter Documents– More diverse vocabulary

• We compare the performance of the algorithm against three approaches in the literature : LDA, CTM and Pachinko

• We test our model using Empirical Likelihood– This method estimate how likely it is that a test document will be generated

from the estimated model. – We want this value to be high (better generalization and applicability to

unseen documents).10/6/2011

Dataset NIPS NYT APW XIE

#documents 1840 5553 4954 5275

# unique terms 13649 11229 6955 3890

Doc Length 1322 274 170 81

Results: NYT DatasetWe obtain the topic mixture for the NYT Dataset using K=20 topics .

10/6/2011

yearlivecenturymuseumpeoplemusictimestarbook

storypjournalconstitutiontimeeditorbudgetyork

militarywarnuclearpresidentpoliticchechnyapowersoviet

internetinformationtechnologyserviceipeopleebusyworkmail

computermakehandsystemtvpeopleprogramnetworkdontdrivecall

studypatientpeopleuniversdiseasemedicincreasecarestate

bankproblemeconomysysteminvestorpercentpriceinvestmenteconomistfinancial

+

-++

+ drugstateunitedtalknatoclintonamerican

+

+

-

13

Results: Empirical Likelihood

10/6/2011

APW Dataset NIPS Dataset

NYT Dataset XIE Dataset

Our Model

Results: Running Time

10/6/2011

APW Dataset NIPS Dataset

Minutes

Minutes

NYT Dataset

Minutes

XIE Dataset

Minutes Our

Model

Illustrative Example: NYT Dataset

10/6/2011

NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.


10/6/2011



10/6/2011



10/6/2011


Topic Modeling & Entity Association

This work was presented at SAPPHIRE NOW 2010

Base knowledge Source

UCSC Topic

Mining System

Saffron Associative Memory Base

Query

Valukas Report about why Lehman

Brothers Failed

(6 volumes)

SAP Business Objects Entity

Extractor

Entities

TopicsSaffron Associative

Memory creates associations among entities and topics

We would like to know who are

the actors involved in a

particular action that led to the

failure of Lehman brothers

Text Data to be monitored

Outline

• Motivation• Statistical Topic modeling - SAP & Saffron

• Knowledge Extraction and Reuse in Cisco– Knowledge Extraction System– System Architecture– Domain Knowledge– Improving Productivity– Performance of Service Request Recommender

• Interactive Retrieval• Interactive Retrieval Demo

Service Request Database

Service Request

Text Mining System

What was the problem?

Why did it occur?

How was it solved?

Problem

Cause

Solution

Irrelevant Content

KnowledgeUnstructured Text

Knowledge Database

Applicationssuch as retrieval

Problem

Cause

Solution

Document 1

Problem

Cause

Solution

Document 2

high

Similarity

high

low

Finding different solutions to the same problem

Knowledge Extraction System at Cisco

Service Request

HierarchicalClassifier

Labeled Paragraphs

Preprocessor

Service Request

Recommender

User

Bag-of-words

Domain Knowledge

ExpertiseFeature Generator

Data flow of Analyzer

Data flow of Recommender

Data output for User

Legend

System ArchitectureType Feature Class and

Motivation

Statistical

features

Length of paragraph Short paragraphs are usually irrelevant.

Relative position of a paragraph in a service request

Service requests have the hidden process “problem → cause→ solution”.

Number of “%” Error codes (relevant) begin with “%”.

Contextual

features

Contain “Hi”, “Hello”, “my name”, or “I’m”

Introduction, irrelevant

Contain “feel free”, “to contact”, or “have a ... day”; begin with “Best” or “Thank”

Salutation, irrelevant

Telephone number, zip code, or affiliation

Contact information, irrelevant

Hint words

Contain “problem”, “error message” or “symptom”

Problem

Contain “suspect”, “seem”, “looks like”, “indicate”, “try”, “test”, or “check”

Troubleshooting

Contain “recommend”, “suggest”, “replace”, “reseat”, “RMA”, or “workaround”

Solution

Lexical features

Number of words from domain dictionary

Usually relevant

Product name Usually relevant

Features from Expertise

- Internetworking Terms and Acronyms Dictionary (ITAD)- Benefits: (1) the expansion of acronyms and terminology;

(2) the enhancement of concept dependencies.- Example:

The phone boots up and it does a DHCP [Dynamic Host Configuration Protocol. Provides a mechanism for allocating IP addresses dynamically so that addresses can be reused when hosts no longer need them] request in the native VLAN [virtual LAN]. There it gets an IP address [32-bit address assigned to hosts using TCP/IP] and an option that it needs to boot up in the VLAN 40 and that it need to go in trunking [physical and logical connection between two switches across which network traffic travels] mode.

Host Server with 2 interfaces [connection between two systems or devices] and one default gateway. When ping Vlan-B [virtual LAN] interface an ARP [Address Resolution Protocol. Internet protocol used to map an IP address to a MAC address] request with a source IP of Vlan-B is sent to Default Router [network layer device that uses one or more metrics to determine the optimal path along which network traffic should be forwarded. Routers forward packets from one network to another based on network layer information] on Vlan-A, but Router does not respond to ARP request.

Snippet from Doc1

Snippet from Doc2

[…]: explanation from ITAD. Blue: overlapping words between unexpanded excerpts.Red: overlapping words introduced by ITAD.

Measuring similarity

Domain Knowledge

Browse a service request

Relevant?N

Read and understand thoroughly

Create knowledge article

Y

N

Y

Time to access relevance

Time to extract knowledge

Read enough?

Improving ProductivityCompare the time spent by engineers in reading service requests before and after using our system.

Time to access relevance

Time to extract knowledge

Before using system 27 minutes 97 minutes

After using system 11 minutes 67 minutes

Productivity improved by

145% 45%

Performance of Service Request Recommender

Result 1: Both deterministic and probabilistic model achieved much better results when labeled paragraphs were used; validates our hypothesis of the inherent diagnostic business process.

Result 2: Using domain knowledge further improves retrieval results. Result 3: Probabilistic recommender outperformed deterministic recommender.

Baseline Our Method

Retrieval models

Deterministic model

Probabilistic model

Information The whole document

The semantically labeled paragraphs

Domain Knowledge

None Dictionary

Retrieval SchemesOur

Method

Outline• Motivation• Statistical Topic modeling – SAP & Saffron • Knowledge Extraction and Reuse at Cisco

• Interactive Retrieval– Problem– Reinforcement Learning Formulation– How many interaction steps needed– How much feedback is needed– Interactive Retrieval Using Topic Modeling

• Interactive Retrieval Demo

Interactive Retrieval• Model the user intent to retrieve relevant documents• Identify the trade-off between

– Retrieval accuracy (how accurate are the results required to be by the user?)

– Interaction time (how much time is the user willing to spend on interaction?)

• Applied to– Medical documents retrieval

• e.g., search for past patient cases with similar symptoms

– Resume retrieval in a labor marketplace• e.g., search for Python developers who work in machine learning

MORE IMPORTANT

LESS IMPORTANT

28

Problem

10/6/2011 What is the best path to choose ?

User Intent

Set of Relevant Documents

Static Myopic Dynamic

Dynamic

Dynamic Programming

Reinforcement Learning

t1 t2 t3 … tn

User Intent


User Intent


Reinforcement Learning formulation of IIR

Agent IIR system

Environment User

IntentBest guess for user intent or need

(expressed in query terms)

Action Ranking Rk

Reward Improvement

v(Rk)-v(Rk-1)(as observed from user

feedback)

ObjectiveMax. sum of

rewards

Experiments Set-Up

• Dataset: TREC-9 OHSUMED, 348.566 medical documents– with a list of relevance judgments

• 65 user queries– query title: 2 − 5 words– query description: 5 − 10 words

• Interactive Sessions of 3 − 5 steps• Relevance function is binary• Value of results (with appropriate weights wi)

– Precision @10: percentage of relevant documents in the top-10 results– We compare our results with Pseudo-relevance Feedback

How many interaction steps needed?

9/19/2011

How much feedback is needed?

1 2 3 4 5 6 70.600000000000001

0.650000000000001

0.700000000000001

0.750000000000001

0.800000000000001

0.850000000000001

# of documents on which feedback is provided per step

prec

isio

n @

10

Experiments tested on348,566 OHSU-MED medical dataset, TREC 2002

Interactive Retrieval w Topic Modeling• Topics help us to reduce the search

– They add context to the query– Some important terms to describe the users’ intent may not be

included in the query– Topics are calculated a-priori and added to each document as metadata

Topic Mixture ofNon Relevant Docs

Topic Mixture ofRelevant Docs

Combination of terms and topic relevance

scores

Meta-query(combination of

user inputs)

Updated each time the user provides feedback (clicks) or additional information to the system (query redefinition)

Proposed Dataset

• We test our approach using the HARD TREC queries which consist of :– 851,018 news documents from NYT APW and XIE

agencies– Each document has an average length of 305 terms– There are 496,779 unique terms– We infer the topic information of the corpus using 75 topics

– For testing purposes we use m=3 interactions– We use test 30 queries– We compare our algorithm with mixture relevance feedback

10/6/2011

Preliminary Results

10/6/2011

Number of Interactions

Precision

1 2 30.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MixtureState Based

Outline

• Motivation• Statistical Topic modeling – SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval

• Interactive Retrieval Demo

Example User intent• young female with fevers and increased CPK (Creatine PhosphoKinase)

– CPK: enzyme, may cause heart attack or severe muscle breakdown if increased

• neuroleptic malignant syndrome (life-threatening neurological disorder)– Associated with CPK– Symptoms: muscular cramps, fever, unstable blood pressure, changes in

cognition, including agitation, delirium and coma

• differential diagnosis– List symptoms– List causes of the symptoms– Prioritize by the most dangerous – Treat

• treatment

Relevant Documents

• Non-relevant documents:Doc 1: Significance of elevated levels of CPK in febrile diseases: a prospective study. The incidence and significance of elevated serum levels of (CPK) in febrile diseases were studied prospectively in all patients admitted with fever to a department of medicine during 1 year.

Doc 2: Metoclopramide-induced neuroleptic malignant syndrome….Symptoms of NMS include rigidity, hyperpyrexia, altered consciousness, and autonomic instability. This syndrome is generally associated with neuroleptic medications used to treat psychotic and major depressive illnesses…

• Relevant document:Doc 3: Neuroleptic malignant syndrome: guidelines for treatment and reinstitution of neuroleptics… Cardinal symptoms include fever, muscular rigidity, an elevated serum level of creatine phosphokinase, changes in mental status, and autonomic dysfunction…

Interactive Demo

• InteractiveDemo_MedicalData

• Sub-queries– young female with fevers and increased CPK– neuroleptic malignant syndrome– differential diagnosis– treatment

http://maraki.selfip.net:8080/interactive_medical/search.jsp

Technology

Dynamic Search Using Semantics & Statistics