Upload
osman
View
17
Download
0
Tags:
Embed Size (px)
DESCRIPTION
MASS COLLABORATION AND DATA MINING. Raghu Ramakrishnan Founder and CTO, QUIQ Professor, University of Wisconsin-Madison Keynote Talk, KDD 2001, San Francisco. DATA MINING. Extracting actionable intelligence from large datasets. - PowerPoint PPT Presentation
Citation preview
MASS COLLABORATION AND DATA MINING
Raghu Ramakrishnan
Founder and CTO, QUIQ
Professor, University of Wisconsin-Madison
Keynote Talk, KDD 2001, San Francisco
Page 2
University of Wisconsin-Madison
DATA MINING
• Is it a creative process requiring a unique combination of tools for each application?
• Or is there a set of operations that can be composed using well-understood principles to solve most target problems?
• Or perhaps there is a framework for addressing large classes of problems that allows us to systematically leverage the results of mining.
Extracting actionable intelligence from large datasets
Page 3
University of Wisconsin-Madison
“MINING” APPLICATION CONTEXT
• Scalability is important. – But when is 2x speed-up or scale-up important?
When is 10x unimportant?
• What is the appropriate measure, model?– Recall, precision– MT for search vs. MT for content conversion
Answers to these questions come from the context of the application.
Page 4
University of Wisconsin-Madison
TALK OUTLINE
• A New Approach to Customer Support– Mass Collaboration
• Technical challenges– A framework and infrastructure for P2P
knowledge capture and delivery
• Role of data mining– Confluence of DB, IR, and mining
Page 5
University of Wisconsin-Madison
TYPICAL CUSTOMER SUPPORT
Support Center
Customer
Web Support
KB
Page 6
University of Wisconsin-Madison
TRADITIONAL KNOWLEDGE MANAGMENT
KNOWLEDGEBASE
QUESTION
CONSUMERS
EXPERTS
ANSWER
Knowledge created and structured by trained experts
using a rigorous process.
Page 7
University of Wisconsin-Madison
MASS COLLABORATION
KNOWLEDGEBASE
MASS COLLABORATION-Experts -Partners-Customers -Employees
QUESTION
Answer added to power self
service
SELF SERVICE
ANSWER
People using the web to share
knowledge and help each other find
solutions
Page 8
University of Wisconsin-Madison
65% (3,247)
77% (3,862)
86% (4,328)
6,845
74% answered
Answersprovidedin 12h
Answersprovidedin 24h
40% (2,057)
Answersprovided
in 3h
Answersprovidedin 48h
Questions
• No effort to answer each question
• No added experts
• No monetary incentives for enthusiasts
TIMELY ANSWERS
77% of answers are provided within 24h
Page 9
University of Wisconsin-Madison
MASS CONTRIBUTION
Users who on average provide only 2 answers provide 50% of all answers
7 % (120) 93 % (1,503)
50 % (3,329)
100 %(6,718)
Answers
ContributingUsers
Top users
Contributed by mass of users
Page 10
University of Wisconsin-Madison
POWER OF KNOWLEDGE CREATION
- 85%
- 64%
Support Incidents Agent Cases
5 %
Self-Service *)
CustomerMass Collaboration *)
KnowledgeCreation
SHIELD 1
SHIELD 2
*) Averages from QUIQ implementations
SUPPORT
Page 11
University of Wisconsin-Madison
TYPICAL SERVICE CHAIN
Self Service
Knowledge base
FAQAuto Email
Manual Email
ChatCall
Center2nd Tier Support
50% 40% 10%
$$ $$$$
QUIQ SERVICE CHAIN
Self
Service
Manual Email
Chat Call Center
2nd Tier Support
80% 15% 5%
MassCollaboration
QUIQ QUIQ
$$ $$$$
Page 12
University of Wisconsin-Madison
CASE STUDIES: COMPAQ
“In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.”
“Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.”
– Steve Young, VP of Customer Care, Compaq
Page 13
University of Wisconsin-Madison
“Austin-based National Instruments deployed … a Network to capture the specialized knowledge of its clients and take the burden off its costly support engineers, and is pleased with the results. QUIQ increased customers’ participation, flattened call volume and continues to do the work of 50 support engineers.”
– David Daniels, Jupiter Media Metrix
ASP 2001 “Top Ten Support Site”
Page 14
University of Wisconsin-Madison
MASS COLLABORATION
Communities
+ Knowledge Management
+ Service Workflows
Support Newsgroups
Support Knowledge Base
Call Center
Solutions
Interactions
Few
Exp
ert
s
Man
yExp
ert
s
Mass Collaboration
Internet-scale P2P knowledge sharing
Page 15
University of Wisconsin-Madison
KnowledgebaseKnowledgebase
Customers
Partners
CORPORATE MEMORY CORPORATE MEMORY Untapped Knowledge in Extended Business CommunityUntapped Knowledge in Extended Business Community
Suppliers Employees
Page 16
University of Wisconsin-Madison
Areas of Areas of InterestInterest
Self-Organizing
UserAcquisition
Incentive to Participate
User-to-User
Exchange
Structured User Forum
User-to-Enthusiast
User-to-Expert
Page 17
University of Wisconsin-Madison
GOALS & ISSUES
• Interactions must be structured to encourage creation of “solutions”– Resolve issue; escalate if necessary– Capture knowledge from interactions– Encourage participation
• Sociology– Privacy, security– Credibility, authority, history– Accountability, incentives
Page 18
University of Wisconsin-Madison
REQUIRED CAPABILITIES
• Roles: Credibility, administration– Moderators, experts, editors, enthusiasts
• Groups: Privacy, security, entitlements– Departments, gold customers
• Workflow: QoS, validation, escalation
Page 19
University of Wisconsin-Madison
TECHNICAL CHALLENGES
Page 20
University of Wisconsin-Madison
?SEARCH
ROUTING,NOTIFICATION
SEARCHING “PEOPLE-BASES”
“If it’s not there, find someone who knows”- And get “it” there (knowledge creation)!
Page 21
University of Wisconsin-Madison
Email Support
-20%
Support Incidents Agent Cases
80%
Automated Emails 1)
Call Center
Support Incidents Agent Cases
100%
Web Self-Service
-42%
Support Incidents Agent Cases
68%
Self-Service 2)
QUIQ, the “Best in Class” Support Channel
Mass Collaboration
-85%
-64%
Support Incidents Agent Cases
5%
Self-Service
CustomerMass Collaboration
KnowledgeCreation
1) Source: QUIQ Client Information2) Source: Association of Support Professionals
SUPPORT
Page 22
University of Wisconsin-Madison
SEARCH AND INDEXING
• User types in “How can I configure the IP address on my Presario?”– Need to find most relevant content that is of high
quality and is approved for external viewing, and that this user is entitled to see based on her roles, groups, and service levels.
• User decides to post question because no good answer was found in the KB.– Search controls when experts and other users will
see this new question; need to make this real-time.– Concurrency, recovery issues!
Page 23
University of Wisconsin-Madison
SEARCH AND INDEXING
• Data is organized into tabular channels– Questions, responses, users, …
• Each item has several fields, e.g., a question:– Author id, author status, service level, item
popularity metrics, rating metrics, answer status, approval status, visibility group, update timestamp, notification timestamp, usage signature, category, relevant products, relevant problems, subject, body, responses
Which 5 items should be returned?
Page 24
University of Wisconsin-Madison
Cache Indexer Alerts
Hive Manager
Files, Logs DBMS
Web server Web server
RAID STORAGE
Warehouse
RUNTIME ARCHITECTURE
Real-timeIndexing,Caching,
Alerts
Page 25
University of Wisconsin-Madison
IndexerMiner
Files, Logs DBMS
RAID STORAGE
Warehouse
LEARNING FROM ACTIVITY DATA TO KNOWLEDGE
Small readsLarge R/W
Periodicoffline activity
Page 26
University of Wisconsin-Madison
SEARCH AND INDEXING
• Question text, user attributes, system policies • IR-style ranked output• Search constraints:
– Show matches; subject match twice as important– Show only approved answers to non-editors– Give preference to category Laptop– Give preference to recent solutions– Weight quality of solution
Which 5 items should be returned?
Page 27
University of Wisconsin-Madison
VECTOR SPACE MODEL
• Documents, queries are vectors in term space
• Vector distance from the query is used to rank retrieved documents
edunnormaliz ),(
...,,
...,,
12121
2,22212
1,12111
t
iii
t
t
wwD Q sim
w w w D
w w w Q
i’th term in summation can be seen as the “relevance contribution” of term i
Page 28
University of Wisconsin-Madison
TF-IDF DOCUMENT VECTOR
)/log(* kikik nNtfw
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
Page 29
University of Wisconsin-Madison
A HYBRID DB-IR SYSTEM
• Searches are queries with three parts:– Filter
• DB-style yes/no criteria
– Match• TF-IDF relevance based on a combination of fields
– Quality• Relevance “boost” based on a policy
Page 30
University of Wisconsin-Madison
A HYBRID DB-IR SYSTEM
• A query is built up from atomic constraints using Boolean operators.
• Atomic constraint:– [ value op term, constraint-type ]– Terms are drawn from discrete domains and
are of two types: hierarchy and scalar– Constraint-type is exact or approximate
Page 31
University of Wisconsin-Madison
A HYBRID DB-IR SYSTEM
• Applying an atomic constraint to a set of items returns a tagged result set:– The result inherits the constraint-type– Each result item has a (TF-IDF) relevance score; 0 for exact
• Combining two tagged item sets using Boolean operators yields a tagged set:– The result type is exact if both inputs are exact, and
approximate otherwise– Result contains intersection of input item sets if either input
is exact; union otherwise– Each result item is tagged with a combined relevance
Page 32
University of Wisconsin-Madison
A HYBRID DB-IR SYSTEM
• Semantics of Boolean expressions over constraints is associative and commutative
• Evaluating exact constraints and approximate constraints separately (in DB and IR subsystems) is a special case. Additionally:– Uniform handling of relevance contributions of
categories, popularity metrics, recency, etc.
• Absolute and relative relevance modifiers can be introduced for greater flexibility.
Page 33
University of Wisconsin-Madison
CONCURRENCY, RECOVERY, PARALLELISM
• Concurrency– Index is updated in real-time– Automatic partitioning, two-step locking protocol result in
very low overhead– Relies upon post-processing to address some anomalies
• Recovery– Partitioning is again the key– Leverages recovery guarantees of DBMS – Approach also supports efficient refresh of global
statistics
• Parallelism– Hash based partitioning
Page 34
University of Wisconsin-Madison
NOTIFICATION
• Extension of search: Each user can define one or more “standing searches”, and request instant or periodic notification.– Boolean combinations of atomic constraints.
• Major challenges:– Scaling with number of standing searches.
• Requires multiple timestamps, indexing searches.
– Exactly-once delivery property.• Many subtleties center around “notifiability” of updates!
Page 35
University of Wisconsin-Madison
ROLE OF DATA MINING
Page 36
University of Wisconsin-Madison
DATA MINING TASKS
• There is a lot of insight to be gained by analyzing the data.– What will help the user with her problem?– Who does a given user trust?– Characteristic metrics for high-quality content.– Identify helpful content in similar, past queries.– Summarize content.– Who can answer this question?
Page 37
University of Wisconsin-Madison
LEVERAGING DATA MINING
• How do we get at the data?– Relevant information is distributed across
several sources, not just the DBMS.– Aggregated in a warehouse.
• How do we incorporate the insights obtained by mining into the search phase?– Need to constantly update info about every
piece of content (Qs, As, users …)
Page 38
University of Wisconsin-Madison
LEVERAGING DATA MINING
• Three-step approach:– Off-line analysis to gather new insight– Periodic refresh indexes– Use insight (from KB/index) to improve
search using the extended DB/IR query framework
Use mining to create useful metadata
Page 39
University of Wisconsin-Madison
SOME UNIQUE TWISTS
• Identify the kinds of feedback that would be helpful in refining a search.– I.e., Not just specific terms, but the types of
concepts that would be useful discriminators (e.g., a good hierarchy of feedback concepts)
• Metrics of quality– Link-analysis is a good example, but what are
the “links” here?
• Self-tuning searches– The more the knobs, the more the choices– Next step: self-personalizing searches?
Page 40
University of Wisconsin-Madison
CONCLUSIONS
Page 41
University of Wisconsin-Madison
IR SEARCH
DB Q
UERIE
SP2P K
M
CONFLUENCES
?