Upload
paul-hofmann
View
532
Download
1
Tags:
Embed Size (px)
DESCRIPTION
This presentation shows 3 applications of successfully combining semantics and statistics for text mining and interactive search. 1) We predict the Lehman bankruptcy using statistical topic modeling, SAP Business Objects entity extraction and associative memories (powered by Saffron Technologies). 2) We semi-automatically handle service requests at Cisco using knowledge extraction and knowledge reuse. 3) We discover user intent for interactive retrieval. User intent is defined as a latent state. The observations of this latent state are the reformulated query sequence, and the retrieved documents, together with the positive or negative feedback provided by the user. Demo shows recognizing user’s intent for health care search.
Citation preview
Text Mining - Bayesian Topic Modeling for Interactive Retrieval
at SAP and Cisco
Ram AkellaUniversity of California and Stanford
With Karla Caballero, Maria Daltayanni, Chunye Wang - UCSC andPaul Hofmann SAP Labs
October 6, 2011 SAP
Outline
• Motivation• Statistical Topic Modeling - SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo
Outline
• Motivation• Statistical Topic Modeling - SAP & Saffron • Knowledge Extraction and Reuse in Cisco• Interactive Retrieval• Interactive Retrieval Demo
Motivation
10/6/2011
SEARCH
Depression treatment of patients…
Depression influence on
family relationships…
DOCTOR
SOCIAL SCIENTIST
q1: elderly depression
q2: depression symptoms
q3: symptoms and treatment
User expects to find more relevant results each time she interacts with the system
Relevance of the presented documents depends on user context
Interactive Retrieval Model Query
User Feeback
Feedback and propagation to
similar documents
Information needUpdate
DocumentCollection
Metadata Generation System
Interactive Retrieval System
Interactive Retrieval Model Query
User Feeback
Feedback and propagation to
similar documents
Information needUpdate
DocumentCollection
Interactive Retrieval System
Metadata Generation SystemAdd to the document metadata that facilitates the retrieval processThis metadata consist of:
1. Statistical Topic Mixture2. Knowledge Extraction basedon Business process (problem, cause, solution)
Outline• Motivation
• Statistical Topic Modeling - SAP & Saffron– Motivation– Related Work– Proposed Approach– Topic Modeling and Entity Association
• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval• Interactive Retrieval Demo
Topic Modeling: Motivation• Given a set of documents, we want to identify the main areas or topics
discussed in a unsupervised manner. We take advantage of the semantic associations between words across the documents.
If two words appear in the same document, they should be related.
• For each topic we have different distributions of words and each document might contain material about a variety of topics.
Play
Music
Sports
10/6/2011
Topic 1 (80%)Sports
Topic 2 (5%)
Topic 3 (20%)Common Words
Topic 1Sports
net
game
ball
ball net racquet
notes
instrument
Related WorkLDA[2003] Correlated
Topics [2005]Pachinko Allocation Model [2006]
Our Model GD-LDA
Complexity based on # oftopics K
K 2K
Speed
Scalable
Handles Topic Correlations
Effective topic selection and truncation
Our Approach – The higher probability mass is accommodated in the upper part of the
tree (this facilitates the truncation and reduction in the number of topics)
– We can define a method to determine the number of topics suitable for a particular dataset without training the model several times (each time for a given number of specified topics)
10/6/2011
…
bushcampaign
mccainbradley
republicancandidate
filmshowmusicmoviestoryplay
companypercentstockmarketpricerate
patientDiseasePeopleStudyMedicHealth
peacetalksyrianclintonsyriagolan
…
…
0.00960.0146
0.0310 0.0660
0.0851
Experimental SetupThe datasets are from two types:• Scientific Articles (NIPS)
– Longer documents
• News Data (NYT, APW, XIE)– Shorter Documents– More diverse vocabulary
• We compare the performance of the algorithm against three approaches in the literature : LDA, CTM and Pachinko
• We test our model using Empirical Likelihood– This method estimate how likely it is that a test document will be generated
from the estimated model. – We want this value to be high (better generalization and applicability to
unseen documents).10/6/2011
Dataset NIPS NYT APW XIE
#documents 1840 5553 4954 5275
# unique terms 13649 11229 6955 3890
Doc Length 1322 274 170 81
Results: NYT DatasetWe obtain the topic mixture for the NYT Dataset using K=20 topics .
10/6/2011
yearlivecenturymuseumpeoplemusictimestarbook
storypjournalconstitutiontimeeditorbudgetyork
militarywarnuclearpresidentpoliticchechnyapowersoviet
internetinformationtechnologyserviceipeopleebusyworkmail
computermakehandsystemtvpeopleprogramnetworkdontdrivecall
studypatientpeopleuniversdiseasemedicincreasecarestate
bankproblemeconomysysteminvestorpercentpriceinvestmenteconomistfinancial
+
-++
+ drugstateunitedtalknatoclintonamerican
+
+
-
13
Results: Empirical Likelihood
10/6/2011
APW Dataset NIPS Dataset
NYT Dataset XIE Dataset
Our Model
Results: Running Time
10/6/2011
APW Dataset NIPS Dataset
Minutes
Minutes
NYT Dataset
Minutes
XIE Dataset
Minutes Our
Model
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
Illustrative Example: NYT Dataset
10/6/2011
NORTHRIDGE TAUGHT A LESSONLOS ANGELES _ School has been out at Cal State Northridge since the week before Christmas, but since you can learn something everyday, Mississippi State's women's basketball team gave a lesson. Northridge has talked about taking its game to the next level. The 21st-ranked Bulldogs _ the first nationally ranked team to play here in Northridge's Division I era _ gave a glimpse of that level in a 98-64 nonconference victory before a crowd of 165 Friday night.
Topic Modeling & Entity Association
This work was presented at SAPPHIRE NOW 2010
Base knowledge Source
UCSC Topic
Mining System
Saffron Associative Memory Base
Query
Valukas Report about why Lehman
Brothers Failed
(6 volumes)
SAP Business Objects Entity
Extractor
Entities
TopicsSaffron Associative
Memory creates associations among entities and topics
We would like to know who are
the actors involved in a
particular action that led to the
failure of Lehman brothers
Text Data to be monitored
Outline
• Motivation• Statistical Topic modeling - SAP & Saffron
• Knowledge Extraction and Reuse in Cisco– Knowledge Extraction System– System Architecture– Domain Knowledge– Improving Productivity– Performance of Service Request Recommender
• Interactive Retrieval• Interactive Retrieval Demo
Service Request Database
Service Request
Text Mining System
What was the problem?
Why did it occur?
How was it solved?
Problem
Cause
Solution
Irrelevant Content
KnowledgeUnstructured Text
Knowledge Database
Applicationssuch as retrieval
Problem
Cause
Solution
Document 1
Problem
Cause
Solution
Document 2
high
Similarity
high
low
Finding different solutions to the same problem
Knowledge Extraction System at Cisco
Service Request
HierarchicalClassifier
Labeled Paragraphs
Preprocessor
Service Request
Recommender
User
Bag-of-words
Domain Knowledge
ExpertiseFeature Generator
Data flow of Analyzer
Data flow of Recommender
Data output for User
Legend
System ArchitectureType Feature Class and
Motivation
Statistical
features
Length of paragraph Short paragraphs are usually irrelevant.
Relative position of a paragraph in a service request
Service requests have the hidden process “problem → cause→ solution”.
Number of “%” Error codes (relevant) begin with “%”.
Contextual
features
Contain “Hi”, “Hello”, “my name”, or “I’m”
Introduction, irrelevant
Contain “feel free”, “to contact”, or “have a ... day”; begin with “Best” or “Thank”
Salutation, irrelevant
Telephone number, zip code, or affiliation
Contact information, irrelevant
Hint words
Contain “problem”, “error message” or “symptom”
Problem
Contain “suspect”, “seem”, “looks like”, “indicate”, “try”, “test”, or “check”
Troubleshooting
Contain “recommend”, “suggest”, “replace”, “reseat”, “RMA”, or “workaround”
Solution
Lexical features
Number of words from domain dictionary
Usually relevant
Product name Usually relevant
Features from Expertise
- Internetworking Terms and Acronyms Dictionary (ITAD)- Benefits: (1) the expansion of acronyms and terminology;
(2) the enhancement of concept dependencies.- Example:
The phone boots up and it does a DHCP [Dynamic Host Configuration Protocol. Provides a mechanism for allocating IP addresses dynamically so that addresses can be reused when hosts no longer need them] request in the native VLAN [virtual LAN]. There it gets an IP address [32-bit address assigned to hosts using TCP/IP] and an option that it needs to boot up in the VLAN 40 and that it need to go in trunking [physical and logical connection between two switches across which network traffic travels] mode.
Host Server with 2 interfaces [connection between two systems or devices] and one default gateway. When ping Vlan-B [virtual LAN] interface an ARP [Address Resolution Protocol. Internet protocol used to map an IP address to a MAC address] request with a source IP of Vlan-B is sent to Default Router [network layer device that uses one or more metrics to determine the optimal path along which network traffic should be forwarded. Routers forward packets from one network to another based on network layer information] on Vlan-A, but Router does not respond to ARP request.
Snippet from Doc1
Snippet from Doc2
[…]: explanation from ITAD. Blue: overlapping words between unexpanded excerpts.Red: overlapping words introduced by ITAD.
Measuring similarity
Domain Knowledge
Browse a service request
Relevant?N
Read and understand thoroughly
Create knowledge article
Y
N
Y
Time to access relevance
Time to extract knowledge
Read enough?
Improving ProductivityCompare the time spent by engineers in reading service requests before and after using our system.
Time to access relevance
Time to extract knowledge
Before using system 27 minutes 97 minutes
After using system 11 minutes 67 minutes
Productivity improved by
145% 45%
Performance of Service Request Recommender
Result 1: Both deterministic and probabilistic model achieved much better results when labeled paragraphs were used; validates our hypothesis of the inherent diagnostic business process.
Result 2: Using domain knowledge further improves retrieval results. Result 3: Probabilistic recommender outperformed deterministic recommender.
Baseline Our Method
Retrieval models
Deterministic model
Probabilistic model
Information The whole document
The semantically labeled paragraphs
Domain Knowledge
None Dictionary
Retrieval SchemesOur
Method
Outline• Motivation• Statistical Topic modeling – SAP & Saffron • Knowledge Extraction and Reuse at Cisco
• Interactive Retrieval– Problem– Reinforcement Learning Formulation– How many interaction steps needed– How much feedback is needed– Interactive Retrieval Using Topic Modeling
• Interactive Retrieval Demo
Interactive Retrieval• Model the user intent to retrieve relevant documents• Identify the trade-off between
– Retrieval accuracy (how accurate are the results required to be by the user?)
– Interaction time (how much time is the user willing to spend on interaction?)
• Applied to– Medical documents retrieval
• e.g., search for past patient cases with similar symptoms
– Resume retrieval in a labor marketplace• e.g., search for Python developers who work in machine learning
MORE IMPORTANT
LESS IMPORTANT
28
Problem
10/6/2011 What is the best path to choose ?
User Intent
Set of Relevant Documents
Static Myopic Dynamic
Dynamic
Dynamic Programming
Reinforcement Learning
t1 t2 t3 … tn
User Intent
Set of Relevant Documents
User Intent
Set of Relevant Documents
Reinforcement Learning formulation of IIR
Agent IIR system
Environment User
IntentBest guess for user intent or need
(expressed in query terms)
Action Ranking Rk
Reward Improvement
v(Rk)-v(Rk-1)(as observed from user
feedback)
ObjectiveMax. sum of
rewards
Experiments Set-Up
• Dataset: TREC-9 OHSUMED, 348.566 medical documents– with a list of relevance judgments
• 65 user queries– query title: 2 − 5 words– query description: 5 − 10 words
• Interactive Sessions of 3 − 5 steps• Relevance function is binary• Value of results (with appropriate weights wi)
– Precision @10: percentage of relevant documents in the top-10 results– We compare our results with Pseudo-relevance Feedback
How many interaction steps needed?
9/19/2011
How much feedback is needed?
1 2 3 4 5 6 70.600000000000001
0.650000000000001
0.700000000000001
0.750000000000001
0.800000000000001
0.850000000000001
# of documents on which feedback is provided per step
prec
isio
n @
10
Experiments tested on348,566 OHSU-MED medical dataset, TREC 2002
Interactive Retrieval w Topic Modeling• Topics help us to reduce the search
– They add context to the query– Some important terms to describe the users’ intent may not be
included in the query– Topics are calculated a-priori and added to each document as metadata
Topic Mixture ofNon Relevant Docs
Topic Mixture ofRelevant Docs
Combination of terms and topic relevance
scores
Meta-query(combination of
user inputs)
Updated each time the user provides feedback (clicks) or additional information to the system (query redefinition)
Proposed Dataset
• We test our approach using the HARD TREC queries which consist of :– 851,018 news documents from NYT APW and XIE
agencies– Each document has an average length of 305 terms– There are 496,779 unique terms– We infer the topic information of the corpus using 75 topics
– For testing purposes we use m=3 interactions– We use test 30 queries– We compare our algorithm with mixture relevance feedback
10/6/2011
Preliminary Results
10/6/2011
Number of Interactions
Precision
1 2 30.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MixtureState Based
Outline
• Motivation• Statistical Topic modeling – SAP & Saffron• Knowledge Extraction and Reuse at Cisco• Interactive Retrieval
• Interactive Retrieval Demo
Example User intent• young female with fevers and increased CPK (Creatine PhosphoKinase)
– CPK: enzyme, may cause heart attack or severe muscle breakdown if increased
• neuroleptic malignant syndrome (life-threatening neurological disorder)– Associated with CPK– Symptoms: muscular cramps, fever, unstable blood pressure, changes in
cognition, including agitation, delirium and coma
• differential diagnosis– List symptoms– List causes of the symptoms– Prioritize by the most dangerous – Treat
• treatment
Relevant Documents
• Non-relevant documents:Doc 1: Significance of elevated levels of CPK in febrile diseases: a prospective study. The incidence and significance of elevated serum levels of (CPK) in febrile diseases were studied prospectively in all patients admitted with fever to a department of medicine during 1 year.
Doc 2: Metoclopramide-induced neuroleptic malignant syndrome….Symptoms of NMS include rigidity, hyperpyrexia, altered consciousness, and autonomic instability. This syndrome is generally associated with neuroleptic medications used to treat psychotic and major depressive illnesses…
• Relevant document:Doc 3: Neuroleptic malignant syndrome: guidelines for treatment and reinstitution of neuroleptics… Cardinal symptoms include fever, muscular rigidity, an elevated serum level of creatine phosphokinase, changes in mental status, and autonomic dysfunction…
Interactive Demo
• InteractiveDemo_MedicalData
• Sub-queries– young female with fevers and increased CPK– neuroleptic malignant syndrome– differential diagnosis– treatment