MASS COLLABORATION AND DATA MINING

MASS COLLABORATION AND DATA MINING

Raghu Ramakrishnan

Founder and CTO, QUIQ

Professor, University of Wisconsin-Madison

Keynote Talk, KDD 2001, San Francisco

Page 2

University of Wisconsin-Madison

DATA MINING

• Is it a creative process requiring a unique combination of tools for each application?

• Or is there a set of operations that can be composed using well-understood principles to solve most target problems?

• Or perhaps there is a framework for addressing large classes of problems that allows us to systematically leverage the results of mining.

Extracting actionable intelligence from large datasets

Page 3


“MINING” APPLICATION CONTEXT

• Scalability is important. – But when is 2x speed-up or scale-up important?

When is 10x unimportant?

• What is the appropriate measure, model?– Recall, precision– MT for search vs. MT for content conversion

Answers to these questions come from the context of the application.

Page 4


TALK OUTLINE

• A New Approach to Customer Support– Mass Collaboration

• Technical challenges– A framework and infrastructure for P2P

knowledge capture and delivery

• Role of data mining– Confluence of DB, IR, and mining

Page 5


TYPICAL CUSTOMER SUPPORT

Support Center

Customer

Web Support

KB

Page 6


TRADITIONAL KNOWLEDGE MANAGMENT

KNOWLEDGEBASE

QUESTION

CONSUMERS

EXPERTS

ANSWER

Knowledge created and structured by trained experts

using a rigorous process.

Page 7


MASS COLLABORATION

KNOWLEDGEBASE

MASS COLLABORATION-Experts -Partners-Customers -Employees

QUESTION

Answer added to power self

service

SELF SERVICE

ANSWER

People using the web to share

knowledge and help each other find

solutions

Page 8


65% (3,247)

77% (3,862)

86% (4,328)

6,845

74% answered

Answersprovidedin 12h


40% (2,057)

Answersprovided

in 3h


Questions

• No effort to answer each question

• No added experts

• No monetary incentives for enthusiasts

TIMELY ANSWERS

77% of answers are provided within 24h

Page 9


MASS CONTRIBUTION

Users who on average provide only 2 answers provide 50% of all answers

7 % (120) 93 % (1,503)

50 % (3,329)

100 %(6,718)

Answers

ContributingUsers

Top users

Contributed by mass of users

Page 10


POWER OF KNOWLEDGE CREATION

- 85%

- 64%

Support Incidents Agent Cases

5 %

Self-Service *)

CustomerMass Collaboration *)

KnowledgeCreation

SHIELD 1

SHIELD 2

*) Averages from QUIQ implementations

SUPPORT

Page 11


TYPICAL SERVICE CHAIN

Self Service

Knowledge base

FAQAuto Email

Manual Email

ChatCall

Center2nd Tier Support

50% 40% 10%

$$ $$$$

QUIQ SERVICE CHAIN

Self

Service

Manual Email

Chat Call Center

2nd Tier Support

80% 15% 5%

MassCollaboration

QUIQ QUIQ

$$ $$$$

Page 12


CASE STUDIES: COMPAQ

“In newsgroups, conversations disappear and you have to ask the same question over and over again. The thing that makes the real difference is the ability for customers to collaborate and have information be persistent. That’s how we found QUIQ. It’s exactly the philosophy we’re looking for.”

“Tech support people can’t keep up with generating content and are not experts on how to effectively utilize the product … Mass Collaboration is the next step in Customer Service.”

– Steve Young, VP of Customer Care, Compaq

Page 13


“Austin-based National Instruments deployed … a Network to capture the specialized knowledge of its clients and take the burden off its costly support engineers, and is pleased with the results. QUIQ increased customers’ participation, flattened call volume and continues to do the work of 50 support engineers.”

– David Daniels, Jupiter Media Metrix

ASP 2001 “Top Ten Support Site”

Page 14


MASS COLLABORATION

Communities

+ Knowledge Management

+ Service Workflows

Support Newsgroups

Support Knowledge Base

Call Center

Solutions

Interactions

Few

Exp

ert

s

Man

yExp

ert

s

Mass Collaboration

Internet-scale P2P knowledge sharing

Page 15


KnowledgebaseKnowledgebase

Customers

Partners

CORPORATE MEMORY CORPORATE MEMORY Untapped Knowledge in Extended Business CommunityUntapped Knowledge in Extended Business Community

Suppliers Employees

Page 16


Areas of Areas of InterestInterest

Self-Organizing

UserAcquisition

Incentive to Participate

User-to-User

Exchange

Structured User Forum

User-to-Enthusiast

User-to-Expert

Page 17


GOALS & ISSUES

• Interactions must be structured to encourage creation of “solutions”– Resolve issue; escalate if necessary– Capture knowledge from interactions– Encourage participation

• Sociology– Privacy, security– Credibility, authority, history– Accountability, incentives

Page 18


REQUIRED CAPABILITIES

• Roles: Credibility, administration– Moderators, experts, editors, enthusiasts

• Groups: Privacy, security, entitlements– Departments, gold customers

• Workflow: QoS, validation, escalation

Page 19


TECHNICAL CHALLENGES

Page 20


?SEARCH

ROUTING,NOTIFICATION

SEARCHING “PEOPLE-BASES”

“If it’s not there, find someone who knows”- And get “it” there (knowledge creation)!

Page 21


Email Support

-20%


80%

Automated Emails 1)

Call Center


100%

Web Self-Service

-42%


68%

Self-Service 2)

QUIQ, the “Best in Class” Support Channel

Mass Collaboration

-85%

-64%


5%

Self-Service

CustomerMass Collaboration

KnowledgeCreation

1) Source: QUIQ Client Information2) Source: Association of Support Professionals

SUPPORT

Page 22


SEARCH AND INDEXING

• User types in “How can I configure the IP address on my Presario?”– Need to find most relevant content that is of high

quality and is approved for external viewing, and that this user is entitled to see based on her roles, groups, and service levels.

• User decides to post question because no good answer was found in the KB.– Search controls when experts and other users will

see this new question; need to make this real-time.– Concurrency, recovery issues!

Page 23


SEARCH AND INDEXING

• Data is organized into tabular channels– Questions, responses, users, …

• Each item has several fields, e.g., a question:– Author id, author status, service level, item

popularity metrics, rating metrics, answer status, approval status, visibility group, update timestamp, notification timestamp, usage signature, category, relevant products, relevant problems, subject, body, responses

Which 5 items should be returned?

Page 24


Email

Cache Indexer Alerts

Hive Manager

Files, Logs DBMS

Web server Web server

RAID STORAGE

Warehouse

RUNTIME ARCHITECTURE

Real-timeIndexing,Caching,

Alerts

Page 25


IndexerMiner

Files, Logs DBMS

RAID STORAGE

Warehouse

LEARNING FROM ACTIVITY DATA TO KNOWLEDGE

Small readsLarge R/W

Periodicoffline activity

Page 26


SEARCH AND INDEXING

• Question text, user attributes, system policies • IR-style ranked output• Search constraints:

– Show matches; subject match twice as important– Show only approved answers to non-editors– Give preference to category Laptop– Give preference to recent solutions– Weight quality of solution

Which 5 items should be returned?

Page 27


VECTOR SPACE MODEL

• Documents, queries are vectors in term space

• Vector distance from the query is used to rank retrieved documents

edunnormaliz ),(

...,,

...,,

12121

2,22212

1,12111

t

iii

t

t

wwD Q sim

w w w D

w w w Q

i’th term in summation can be seen as the “relevance contribution” of term i

Page 28


TF-IDF DOCUMENT VECTOR

)/log(* kikik nNtfw

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

Page 29


A HYBRID DB-IR SYSTEM

• Searches are queries with three parts:– Filter

• DB-style yes/no criteria

– Match• TF-IDF relevance based on a combination of fields

– Quality• Relevance “boost” based on a policy

Page 30



• A query is built up from atomic constraints using Boolean operators.

• Atomic constraint:– [ value op term, constraint-type ]– Terms are drawn from discrete domains and

are of two types: hierarchy and scalar– Constraint-type is exact or approximate

Page 31



• Applying an atomic constraint to a set of items returns a tagged result set:– The result inherits the constraint-type– Each result item has a (TF-IDF) relevance score; 0 for exact

• Combining two tagged item sets using Boolean operators yields a tagged set:– The result type is exact if both inputs are exact, and

approximate otherwise– Result contains intersection of input item sets if either input

is exact; union otherwise– Each result item is tagged with a combined relevance

Page 32



• Semantics of Boolean expressions over constraints is associative and commutative

• Evaluating exact constraints and approximate constraints separately (in DB and IR subsystems) is a special case. Additionally:– Uniform handling of relevance contributions of

categories, popularity metrics, recency, etc.

• Absolute and relative relevance modifiers can be introduced for greater flexibility.

Page 33


CONCURRENCY, RECOVERY, PARALLELISM

• Concurrency– Index is updated in real-time– Automatic partitioning, two-step locking protocol result in

very low overhead– Relies upon post-processing to address some anomalies

• Recovery– Partitioning is again the key– Leverages recovery guarantees of DBMS – Approach also supports efficient refresh of global

statistics

• Parallelism– Hash based partitioning

Page 34


NOTIFICATION

• Extension of search: Each user can define one or more “standing searches”, and request instant or periodic notification.– Boolean combinations of atomic constraints.

• Major challenges:– Scaling with number of standing searches.

• Requires multiple timestamps, indexing searches.

– Exactly-once delivery property.• Many subtleties center around “notifiability” of updates!

Page 35


ROLE OF DATA MINING

Page 36


DATA MINING TASKS

• There is a lot of insight to be gained by analyzing the data.– What will help the user with her problem?– Who does a given user trust?– Characteristic metrics for high-quality content.– Identify helpful content in similar, past queries.– Summarize content.– Who can answer this question?

Page 37


LEVERAGING DATA MINING

• How do we get at the data?– Relevant information is distributed across

several sources, not just the DBMS.– Aggregated in a warehouse.

• How do we incorporate the insights obtained by mining into the search phase?– Need to constantly update info about every

piece of content (Qs, As, users …)

Page 38


LEVERAGING DATA MINING

• Three-step approach:– Off-line analysis to gather new insight– Periodic refresh indexes– Use insight (from KB/index) to improve

search using the extended DB/IR query framework

Use mining to create useful metadata

Page 39


SOME UNIQUE TWISTS

• Identify the kinds of feedback that would be helpful in refining a search.– I.e., Not just specific terms, but the types of

concepts that would be useful discriminators (e.g., a good hierarchy of feedback concepts)

• Metrics of quality– Link-analysis is a good example, but what are

the “links” here?

• Self-tuning searches– The more the knobs, the more the choices– Next step: self-personalizing searches?

Page 40


CONCLUSIONS

Page 41


IR SEARCH

DB Q

UERIE

SP2P K

M

CONFLUENCES

?